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(S) A write buffer for a superpipelined, superscalar microprocessor. 



(57) A superscalar, superpipelined microproces- 
sor having a write buffer located between the 
central processing unit core and memory 
cache. The write buffer stores the results of 
write operations to memory until the cache 
memory becomes available, i.e., when no high- 
-priority reads are to be performed. The write 
buffer includes multiple entries that are split 
into two circular buffer sections for faciitating 
the interaction with the two core pipelines. 
Cross-dependency tables are provided for each 
write buffer entry to ensure that the data is 
written from the write buffer to memory in 
program order, whfle considering any prior data 
in the opposite section. Non-cacheable reads 
from memory are also ordered in program order 
with the writing of data from the write buffer. 
Features for performing misaligned writes, 
handling speculative execution, detecting and 
handling data dependencies and exceptions, 
and performing gathered writes are also in- 
cluded within the microprocessor. 
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Background of the Invention 

1. Held of the Invention 

5 This invention is in the field of integrated circuits of the microprocessor type, and is more specifically di- 
rected to memory access circuitry in the same. 

2. Description of Related Art 

10 In the field of microprocessors, the number of instructions executed per second is a primary performance 
measure. As is well known in the art, many factors in the design and manufacture of a microprocessor impact 
this measure. For example, the execution rate depends quite strongly on the dock frequency of the micropro- 
cessor. The frequency of the clock applied to a microprocessor is limited, however, by power dissipation con- 
cerns and by the switching characteristics of the transistors in the microprocessor. 

15 The architecture of the microprocessor is also a significant factor in the execution rate of a microprocessor. 
For example, many modern microprocessors utilize a "pipelined" architecture to improve their execution rate 
if many of their instructions require multiple dock cydes for execution. According to conventional pipelining 
techniques, each microprocessor instruction is segmented into several stages, and separate drcuitry is pro- 
vided to perform each stage of the instruction. The execution rate of the microprocessor is thus increased by 

20 overlapping the execution of different stages of multiple instructions in each dock cyde. In this way, one mul- 
tiple-cyde instruction may be completed in each dock cyde. 

By way of further background, some microprocessor architectures are of the "superscalar" type, where 
multiple instructions are issued in each dock cyde for execution in parallel. Assuming no dependencies among 
instructions, the increase in instruction throughput is proportional to the degree of scalability. 

25 Another known technique for improving the execution rate of a microprocessor and the system in which it 
is implemented is the use of a cache memory. Conventional cache memories are small high-speed memories 
that store program and data from memory locations which are likely to be accessed in performing later instruc- 
tions, as determined by a selection algorithm. Since the cache memory can be accessed in a reduced number 
of dock cydes (often a single cyde) relative to main system memory, the effective execution rate of a micro- 

30 processor utilizing a cache is much improved over a non-cache system. Many cache memories are located on 
the same integrated circuit chip as the microprocessor itself, providing further performance improvement 

According to each of these architecture-related performance improvement techniques, certain events may 
occur that slow the microprocessor performance. For example, in both the pipelined and the superscalar ar- 
chitectures, multiple instructions may require access to the same internal circuitry at the same time, in which 

35 case one of the instructions will have to wait (i.e., "stall") until the priority instruction is serviced by the drcuitry. 
One type of such a conflict often occurs where one instruction requests a write to memory (induding 
cache) at the same time that another instruction requests a read from the memory. If the instructions are ser- 
viced in a "first-come-first-served" basis, the later-arriving instruction will have to wait for the completion of a 
prior instruction until it is granted memory access. These and other stalls are, of course, detrimental to micro- 

40 processor performance. 

It has been discovered that for most instruction sequences (i.e., programs), reads from memory or cache 
are generally more time-critical than writes to memory or cache, especially where a large number of general- 
purpose registers are provided in the microprocessor architecture. This is because the instructions and input 
data are necessary at spedf ic times in the execution of the program in order for the program to execute in an 

45 efficient manner; in contrast since writes to memory are merely writing the result of the program execution, 
the actual time at which the writing occurs is not as critical since the execution of later instructions may not 
depend upon the result 

By way of further background, write buffers have been provided in microprocessors, such write buffers 
are logically located between on-chip cache memory and the bus to main memory. These conventional post- 
50 cache write buffers receive data from the cache for a write-through or write-back operation; the contents of 
the post-cache write buffer are written to main memory under the control of the bus controller, at times when 
the bus becomes available. 

By way of further background, many modern microprocessors can access memory locations using ad- 
dresses that are not necessarily a modulo of the operand size. An example of a microprocessor type in which 

6 this is the case are those commonly referred to as being "X86" compatible. In these microprocessors, therefore, 
some memory writes may indude bytes which are outside of the byte block containing the lowest byte address, 
such that multiple write cydes are required to accomplish the write operation. These writes are often referred 
to as "misaligned writes". Indeed, a signif icant fraction of memory writes may overlap byte-block boundaries 
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in many microprocessors. 

By way of further background, microprocessors including both an integer central processing unit and a 
floating point processing unit are well known. In such microprocessors, the data word width of the integer re- 
suits may be smaller than that of the floating point unit; for example, integer data may be thirty-two bits wide 
5 while the floating point data may be sixty-four bits wide. Other processing circuitry, besides floating point units, 
may also provide data of wider width than that provided by the central processing unit 

By way of further background, it is important in many microprocessor applications that writes from the cen- 
tral processing unit core to cache or memory be made in program order, meaning in the order of the instructions 
provided by the programmer, to ensure proper program operation. Methods for maintaining the operation of a 
10 single unit in program order are known, for example by provision of a circular buffer and counter. However, 
where more than one buffer or operation must be maintained in program order, however, the use of a single 
counter is inadequate. 

By way of further background, it is well known for microprocessors of conventional architectures, such as 
those having so-called "X86" compatibility, to effect write operations of byte sizes smaller than the capacity 

15 of the internal data bus. 

By way of further background, pipelined microprocessors are known to be vulnerable to certain hazards 
commonly referred to as data dependencies. In general, data dependencies arise when two instructions at dif- 
ferent stages in the pipeline require access to the same register or memory location, as the pipeline may access 
the register or memory location for the later instruction (in program order) before the earlier instruction has 

20 written data thereto, which results in erroneous operation. Techniques for detecting such data dependencies 
in conventional pipelined microprocessors are known in the art, as described in Patterson and Hennessy, Com- 
puter Architecture: A Quantitative Approach (Morgan Kaufman n, 1990), pp. 257-78. According to conventional 
techniques, detection of a data dependency or hazard is handled by stalling the pipeline until the earlier in- 
struction (in program order) is completed, after which the later instruction can be processed. Of course, pipeline 

25 stalls result in loss of performance for the microprocessor. 

By way of further background, some pipelined architecture microprocessors operate according to specu- 
lative execution in order to maintain the pipeline full despite conditional branch or jump instructions being pres- 
ent in the program sequence. Speculative execution requires that predictive branching be performed, where 
the microprocessor predicts whether the conditional branch will be taken or not taken according to an algorithm; 

do the predicted path is then executed in the pipeline. It is important that the results of speculative executed in- 
structions not be written to memory or cache, because if the prediction is incorrect, it may be difficult or im- 
possible to recover from the incorrectly performed memory write. Further, another type of situation can occur 
where instructions are processed in a pipeline, including writes to memory, where an earlier instruction has 
an exception condition (e.g., divide-by- zero) for which the program execution should be immediately stopped. 

35 

Summary of the Invention 

The present invention relates to a write buffer in a microprocessor, logically located between the core of 
the microprocessor and the memory, wherein each write to memory is directed to the write buffer rather than 

40 to a memory bus or cache memory. The contents of the write buffer are then written into cache memory or 
main memory in an asynchronous manner, when the memory bus or cache memory is available. 

An aspect of the invention includes misaligned write handling capability. If a misaligned write is detected, 
a second write buffer entry is allocated with the first write buffer entry, and a higher order address is calculated 
and stored therein. Upon retiring of the misaligned write, the data stored in the first write buffer entry is shifted 

45 into the proper byte lanes so that the lower order bytes are at the higher order end and vice versa. A latch is 
provided for storage of the shifted data after retiring of the first entry and the lower order data. The latch pres- 
ents the higher order data with the address for the second entry, completing the misaligned write. 

Another aspect of the invention includes buffering extra wide data without requiring the entire write buffer 
to be constructed with the extra width. A secondary processing unit, such as a floating point unit, may produce 

50 results in words that are wider (in bits) than the width of words produced by the central processing unit A sec- 
ondary data latch is provided to store the results of the secondary processing unit, with a control bit being set 
when the data therein is valid. A standard write buffer entry is allocated with the physical address corresponding 
to the results of the secondary processing unit, with a control bit being set to indicate that its data will be stored 
in the secondary data latch. Upon retiring of the write buffer entry, the contents of the secondary data latch 

55 rather than the contents of the write buffer entry will be presented to the cache. 

Another aspect of the invention ensures that the writing of data from the write buffer is in program order 
in the case where the write buffer is split into two sections for layout or operational efficiency. Program order 
is maintained by including a cross-dependency field in the write buffer entries that is loaded upon allocation 
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of each write buffer entry with a map showing which write buffer entries in the opposite section have already 
been allocated. The cross-dependency fields in each write buffer entry are cleared, bit by bit, as each write 
buffer earlier in program order is retired. Retiring of a write buffer entry in program order is ensured by requiring 
its cross-dependency field to be clear. Additionally, a similar concept may be used to ensure the performance 
5 of non-cacheable reads in program order with retiring from a write buffer, by providing a cross-dependency 
field for the read that is a map of the allocated write buffer entries at the time the read is allocated, with gating 
of the operations in similar fashion. 

Another aspect of the invention includes provisions for performing gathered writes from the write buffer 
to the cache. During allocation of the write buffer entries, comparisons are made between the physical address 
10 of currently allocated entry and previously allocated to determine if, at least the physical addresses allocated 
are within the same byte group, in which case the mult iple writes may be g atherab! e, or mergeabie, into a single 
write operation to the cache. Other constraints on gatherability can include that the bytes are contiguous with 
one another, and that the writes are from adjacent write instructions in program order. Retiring of gatherable 
write buffer entries is effected by loading a latch with the data from the write buffer entries, after shifting of 
15 the data to place it in the proper byte lanes; the write is effected by presentation of the address in combination 
with the contents of the latch. 

Another aspect of the invention includes provisions for detecting data hazards or dependencies such as 
read- after- write (RAW) dependencies, particularly relative to data already written to the write buffers. Retiring 
of write buffer contents is prevented for those entries subject to a RAW dependency, thus avoiding erroneous 
20 reads. Capability may also be provided for sourctng data directly from the write buffer, or even bypassing the 
write buffer, to reduce the effect of pipeline stalls due to RAW hazards. Further capability may also be provided 
so that only the last of multiple reads may be sourced to the core. Write-after-read control may also be provided 
to avoid false RAW hazard detection. 

Another aspect of the invention includes a speculative execution field of control bits for each write buffer 
25 entry, where writes to the write buffer during speculative execution are allowed. Each control bit corresponds 
to a predictive or speculative branch, and is set upon allocation of a write buffer entry according to the degree 
of speculation of the write. In the event of a misprediction, each write buffer entry having its speculative control 
bit set for the failing prediction is flushed, so that the write buffer entry becomes available for re-allocation. 
Exception handling may be accomplished by clearing all write buffer entries that have been allocated but are 
30 not yet retired at the time of the exception. A no-op bit is provided for each write buffer entry to allow the retire 
pointers and allocation pointers to match when the buffer is empty. 

It is therefore an object of the present invention to provide a microprocessor architecture which buffers 
the writing of data from the CPU core into a write buffer, prior to retiring of the data to a cache, and in which 
misaligned writes may be easily handled with minimal loss of performance. 
35 It is a further object of the present invention to provide a microprocessor architecture which allows for stor- 

age of execution results in a write buffer prior to retiring data to cache or memory, where the write buffer has 
a plurality of locations of a smaller bit width than that provided by a secondary processing unit 

It is a further object of the present invention to provide such an architecture where buffering is provided 
for the results of the secondary processing unit without requiring all write buffer locations to be constructed 
40 to accommodate the extra bit width. 

It is a further object of the present invention to provide a microprocessor architecture which buffers the 
writing of data from the CPU core into a write buffer, prior to retiring of the data to a cache, where the write 
buffer is split into two sections. 

It is a further object of the present invention to provide such an architecture which ensures the retiring of 
45 data from the write buffer to cache or main memory in program order. 

It is a further object of the present invention to provide such an architecture which allows for the performing 
of non-cacheable reads in program order with the retiring of data from the write buffer 

It is a further object of the present invention to provide a microprocessor architecture which allows for stor- 
age of execution results in a write buffer prior to retiring data to cache or memory, and for which the capability 
50 is provided to store the write data from multiple write operations for presentation from the write buffer to the 
cache in a single cycle. 

It is a further object of the present invention to provide for detection of gatherable, or mergeabie, write op- 
erations to enable the gathered write to the cache from the write buffer. 

It is a further object of the present invention to provide a microprocessor architecture which allows for stor- 
55 age of execution results in a write buffer prior to retiring data to cache or memory in a manner in which data 
dependencies may be detected. 

It is a further object of the present invention to provide for allocation of write buffer locations with an indi- 
cation that an otherwise apparent data dependency is in fact not a data dependency. 
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It is a further object of the present invention to provide such an architecture which is implemented in a 
superpipelined superscalar microprocessor architecture. 

It is a further object of the present invention to provide a microprocessor architecture which buffers the 
writing of data from the CPU core into a write buffer, prior to retiring of the data to a cache, and in which recovery 
5 from speculative execution or exceptions can be readily performed. 

Other objects ami advantages of the present invention will be apparent to those of ordinary skill in the art 
having reference to the following specification in combination with the drawings. 

Brief Description of the Drawings 

10 

Figure 1a illustrates a block diagram of the overall microprocessor.. 

Figure 1b illustrates a generalized block diagram of the instruction pipeline stages. 

Figure 2 illustrates a block diagram of a processor system using the microprocessor. 

Figure 3 illustrates a timing diagram showing the flow of instructions through the pipeline stages. 
15 Figure 4 is an electrical diagram, in block form, of the write buffer in the microprocessor of Figure 1a ac- 

cording to the preferred embodiment of the invention. 

Figure 5 is a representation of the contents of one of the entries in the write buffer of Figure 4. 

Figure 6 is a flow chart illustrating the allocation of a write buffer entry during the address calculation stage 
AC2 of the pipeline of Figure 1 b. 
20 Figure 7 is a representation of the physical address comparison process in the allocation of Figure 6. 

Figure 8 is a map of the address valid bits of the cross-dependency field for a write buffer entry for one 
pipeline of the microprocessor of Figure 1a relative to the address valid bits of the write buffer entries for the 
other pipeline of the microprocessor of Figure 1a. 

Figure 9 is a flow chart illustrating the issuing of a write buffer entry according to the preferred embodiment 
25 of the invention. 

Figure 1 0 is a flow chart illustrating the retiring of a write buffer entry according to the preferred embodi- 
ment of the invention. 

Figure 11 is a flow chart illustrating a method for detecting and handling dependency hazards according 
to the preferred embodiment of the invention. 
30 Figures 12a and 12b are flow charts illustrating a method for processing speculative execution and spec- 
ulation faults according to the preferred embodiment of the invention. 

Figure 13 is a flow chart illustrating a method for handling exceptions according to the preferred embodi- 
ment of the invention. 

Figure 14 is a flow chart illustrating a method for allocating write buffer locations for misaligned write op- 
35 erations, according to the preferred embodiment of the invention. 

Figure 15 is a flow chart illustrating a sequence for retiring write buffer locations for misaligned write op- 
erations, according to the preferred embodiment of the invention. 

Figure 16 is a flow chart illustrating a sequence for retiring write buffer locations for gathered write oper- 
ations, according to the preferred embodiment of the invention. 
40 Figure 17 is a representation of a non-cacheable read cross-dependency field as used in the micropro- 

cessor of Figure 1a according to the preferred embodiment of the invention. 

Figures 18a and 18b are flow charts illustrating the allocation and retiring sequences, respectively, of a 
non-cacheable read operation according to the preferred embodiment of the invention. 

45 Detailed Description of the Preferred Embodiment 

The detailed description of an exemplary embodiment of the microprocessor of the present invention is 
organized as follows: 

1 . Exemplary processor system 
so 2. Generalized pipeline architecture 

3. Write buffer architecture and operation 

4. Read-after-write hazard detection and write buffer operation 

5. Speculative execution and exception handling 

6. Special write cycles from the write buffer 
55 7. Conclusion 

This organizational table and the corresponding headings used in this detailed description, are provided 
for the convenience of reference only. Detailed description of conventional or known aspects of the micropro- 
cessor are omitted as to not obscure the description of the invention with unnecessary detail. 
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1. Exemplary Processor System 

The exemplary processor system is shown in Figures 1a, 1b, and Figure 2. Figures 1a and 1b respectively 
illustrate the basic functional blocks of the exemplary superscalar, superpipelined microprocessor along with 
5 the pipe stages of the two execution pipelines. Figure 2 illustrates an exemplary processor system (mother- 
board) design using the microprocessor. 

1.1 Microprocessor 

10 Referring to Figure 1a, the major sub-blocks of a microprocessor 10 include: (a) central processing unit 
(CPU) core 20, (b) prefetch buffer 30 , (c) prefetcher 35, (d) branch processing unit (BPU) 40, (e) address trans- 
lation unit (ATU) 50, and (0 unified 16 Kbyte code/data cache 60, including TAG RAM 62. A 256 byte instruction 
line cache 65 provides a primary instruction cache to reduce instruction fetches to the unified cache, which 
operates as a secondary instruction cache. An onboard floating point unit (FPU) 70 executes floating point 

15 instructions issued to it by the CPU core 20. 

The microprocessor uses internal 32-bit address and 64-bit data buses ADS and DATA. A 256 bit (32 byte) 
prefetch bus (PFB), corresponding to the 32 byte line size of the unified cache 60 and the instruction line cache 
65, allows a full line of 32 instruction bytes to be transferred to the instruction line cache in a single clock. 
Interface to external 32 bit address and 64 bit data buses is through a bus interface unit (BIU). 

20 The CPU core 20 is a superscalar design with two execution pipes X and Y. It includes an instruction de- 
coder 21 , address calculation units 22X and 22Y, execution units 23X and 23Y, and a register file 24 with 32 
32-bit registers. An AC control unit 25 includes a register translation unit 25a with a register scoreboard and 
register renaming hardware. A microcontro! unit 26, including a microsequencerand microROM, provides exe- 
cution control. 

25 Writes from CPU core 20 are queued into twelve 32 bit write buffers 29 — write buffer allocation is 

performed by the AC control unit 25. These write buffers provide an interface for writes to the unified cache 
60 - noncacheable writes go directly from the write buffers to external memory. The write buffer logic supports 
optional read sourcing and write gathering. 

A pipe control unit 28 controls instruction flow through the execution pipes, including: keeping the instruo 

30 tions in order until it is determined that an instruction will not cause an exception; squashing bubbles in the 
instruction stream; and flushing the execution pipes behind branches that are mispredicted and instructions 
that cause an exception. For each stage, the pipe control unit keeps track of which execution pipe contains 
the earliest instruction, provides a "stall" output and receives a "delay" input 

BPU 40 predicts the direction of branches (taken or not taken), and provides target addresses for predicted 

35 taken branches and unconditional change of flow instructions (jumps, calls, returns). In addition, it monitors 
speculative execution in the case of branches and floating point instructions, i.e., the execution of instructions 
speculatively issued after branches which may turn out to be mispredicted, and floating point instructions is- 
sued to the FPU 70 which may fault after the speculatively issued instructions have completed execution. If 
a floating point instruction faults, or if a branch is mispredicted (which will not be known until the EX or WB 

40 stage for the branch), then the execution pipeline must be repaired to the point of the faulting or mispredicted 
instruction (i.e., the execution pipeline is flushed behind that instruction), and instruction fetch restarted. 

Pipeline repair is accomplished by creating checkpoints of the processor state at each pipe stage as a float- 
ing point or predicted branch instruction enters that stage. For these check pointed instructions, all resources 
(programmer visible registers, instruction pointer, condition code register) that can be modified by succeeding 

45 speculatively issued instructions are checkpointed. If a checkpointed floating point instruction faults or a 
check pointed branch is mispredicted, the execution pipeline is flushed behind the checkpointed instruction — 
for floating point instructions, this will typically mean flushing the entire execution pipeline, while for a mis- 
predicted branch there may be a paired instruction in EX and two instructions in WB that would be allowed to 
complete. 

50 For the exemplary microprocessor 1 0, the principle constraints on the degree of speculation are: (a) spec- 
ulative execution is allowed for only up to four floating point or branch instructions at a time (i.e., the speculation 
level is maximum 4), and (b) a write or floating point store will not complete to the cache or external memory 
until the associated branch or floating point instruction has been resolved (Le., the prediction is correct, or 
floating point instruction does not fault). 

55 The unified cache 60 is 4-way set associative (with a 4k set size), using a pseudo-LRU replacement al- 

gorithm, with write-through and write-back modes. It is dual ported (through banking) to permit two memory 
accesses (data read, instruction fetch, or data write) per clock. The instruction line cache is a fully associative, 
lookaside implementation (relative to the unified cache), using an LRU replacement algorithm. 

6 
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The FPU 70 includes a load/store stage with 4-deep load and store queues, a conversion stage (32-bit to 
80-bit extended format), and an execution stage. Loads are controlled by the CPU core 20, and cacheable 
stores are directed through the write buffers 29 (i.e., a write buffer is allocated for each floating point store 
operation). 

5 Referring to Figure 1 b, the microprocessor has seven-stage X and Y execution pipelines: instruction fetch 

(IF), two instruction decode stages (ID1 , ID2), two address calculation stages (AC1,AC2), execution (EX), and 
write-back (WB). Note that the complex ID and AC pipe stages are superpipelined. 

The IF stage provides a continuous code stream into the CPU core 20. The prefetch er 35 fetches 16 bytes 
of instruction data into the prefetch buffer 30 from either the (primary) instruction line cache 65 or the (see- 
to ondary) unified cache 60. BPU 40 is accessed with the prefetch address, and supplies target addresses to 
the prefetch er for predicted changes of flow, allowing the prefetcher to shift to a new code stream in one clock. 

The decode stages ID1 and ID2 decode the variable length X86 instruction set The instruction decoder 
21 retrieves 16 bytes of instruction data from the prefetch buffer 30 each clock. In ID1 , the length of two 
instructions is decoded (one each for the X and Y execution pipes) to obtain the X and Y instruction poin- 
ts ters -- a corresponding X and Y bytes-used signal is sent back to the prefetch buffer (which then increments 
for the next 16 byte transfer). Also in ID1, certain instruction types are determined, such as changes of flow, 
and immediate and/or displacement operands are separated. The ID2 stage completes decoding the X and Y 
instructions, generating entry points for the microROM and decoding addressing modes and register fields. 
During the ID stages, the optimum pipe for executing an instruction is determined, and the instruction is 
20 issued into that pipe. Pipe switching allows instructions to be switched from ID2X to AC1Y, and from ID2Y to 
AC1X For the exemplary embodiment certain instructions are issued only into the X pipeline: change of flow 
instructions, floating point instructions, and exclusive instructions. Exclusive instructions include: any instruc- 
tion that may fault in the EX pipe stage and certain types of instructions such as protected mode segment loads, 
string instructions, special register access (control, debug, test), Multiply/Divide, Input/Output, Push All/Pop 
25 All (PUSH/POPA), and task switch. Exclusive instructions are able to use the resources of both pipes because 
they are issued alone from the ID stage (i.e., they are not paired with any other instruction). Except for these 
issue constraints, any instructions can be paired and issued into either the X or Y pipe. 

The address calculation stages AC1 and AC2 calculate addresses for memory references and supply mem- 
ory operands. The AC1 stage calculates two 32 bit linear (three operand) addresses per clock (four operand 
30 addresses, which are relatively infrequent, take two clocks). Data dependencies are also checked and resolved 
using the register translation unit 25a (register scoreboard and register renaming hardware) - the 32 physical 
registers 24 are used to map the 8 general purpose programmer visible logical registers defined in the X86 
architecture (EAX, EBX f ECX, EDX, EDI. ESI, EBP, ESP). 

The AC unit includes eight architectural (logical) registers (representing the X86 defined register set) that 
35 are used by the AC unit to avoid the delay required to access in AC1 the register translation unit before ac- 
cessing register operands for address calculation. For instructions that require address calculations, AC1 waits 
until the required data in the architectural registers is valid (no read after write dependencies) before accessing 
those registers. During the AC2 stage, the register file 24 and the unified cache 60 are accessed with the phys- 
ical address (for cache hits, cache access time for the dual ported unified cache is the same as that of a register, 
40 effectively extending the register set) — the physical address is either the linear address, or if address trans- 
lation is enabled, a translated address generated by the ATU 50. 

Translated addresses are generated by the ATU 50 from the linear address using information from page 
tables in memory and workspace control registers on chip. The unified cache is virtually indexed and physically 
tagged to permit, when address translation is enabled, set selection with the untranslated address (available 
45 at the end of AC1) and, for each set, tag comparison with the translated address from the ATU 50 (available 
early in AC2). Checks for any segmentation and/or address translation violations are also performed in AC2. 

Instructions are kept in program order until it is determined that they will not cause an exception. For most 
instructions, this determination is made during or before AC2 - floating point instructions and certain exclusive 
instructions may cause exceptions during execution. Instructions are passed in order from AC2 to EX (or in 
so the case of floating point instructions, to the FPU 70) - because integer instructions that may still cause an 
exception in EX are designated exclusive, and therefore are issued alone into both execution pipes, handling 
exceptions in order is ensured. 

The execution stages EXX and EXY perform the operations defined by the instruction. Instructions spend 
a variable number of clocks in EX, i.e., they are allowed to execute out of order (out of order completion). Both 
55 EX stages include adder, logical, and shifter functional units, and in addition, the EXX stage contains multi- 
ply/divide hardware. 

The WB stage updates the register file 24, condition codes, and other parts of the machine state with the 
results of the previously executed instruction. The register file is written in Phase 1 (PH1) of WB and read in 
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Phase2(PH2)ofAC2. 
1.2 System 

5 Referring Id Figure 2, for the exemplary embodiment, microprocessor 10 is used in a processor system 

that includes a single chip memory and bus controller 82. The memory/bus controller 82 provides the interface 
between the microprocessor and the external memory subsystem - level two cache 84 and main memory 86 
- controlling data movement over the 64 bit processor data bus (PD) (the data path is external to the controller 
which reduces its pin count and cost). 

10 Controller 82 interfaces directly to the 32-bit address bus PADDR, and includes a one bit wide data port 
(not shown) for reading and writing registers within the controller. A bi-directional isolation buffer 88 provides 
an address interface between microprocessor 1 0 and VL and ISA buses. 

Controller 82 provides control for the VL and ISA bus interface. A VL/ISA interface chip 91 (such as an 
HT321) provides standard interfaces to a 32 bit VL bus and a 16 bit ISA bus. The ISA bus interfaces to BIOS 

is 92, keyboard controller 93, and I/O chip 94, as well as standard ISA slots 95. The interface chip 91 interfaces 
to the 32 bit VL bus through a bi-directional 32/1 6 multiplexer 96 formed by dual high/low word [31 :1 6]/[1 5:0] 
isolation buffers. The VL bus interfaces to standard VL slots 97, and through a bi-directional isolation buffer 
98 to the low double word [31:0] of the 64 bit processor data bus. 

20 2. Generalized pipeline architecture 

Figure 3 illustrates an example of the performance of four instructions per pipeline, showing the overlap- 
ping execution of the instructions, for a two pipeline architecture. Additional pipelines and additional stages 
for each pipeline could also be provided. In the preferred embodiment, the internal operation of microprocessor 

25 10 is synchronous with internal clock signal 122 at a frequency that is a multiple of that of external system 
clock signal 124. In Figure 3, internal clock signal 122 is at twice the frequency of system clock signal 124. 
During first internal clock cycle 126, first stage instruction decode stages ID1 operate on respective instruc- 
tions X0 and Y0. During second internal clock cycle 128, instructions X0 and Y0 have proceeded to second 
stage instruction decode stages ID2, and new instructions X1 and Y1 are in f irst stage instruction decode units 

30 ID1 . During third internal clock cycle 1 30, instructions X2, Y2 are in first stage decode stages ID1 , instructions 
X1, Y1 are in second stage instruction decode stages ID2, and instructions X0, Y0 are in first address calcu- 
lation units AC1 . During internal clock cycle 1 32, instructions X3, Y3 are in first stage instruction decode stages 
ID1, instructions X2, Y2 are in second stage instruction decode stages ID2, instructions X1, Y1 are in the first 
address calculation stages AC1, and instructions X0 and Y0 are in second address calculation stages AC2. 

35 As is evident from this description, successive instructions continue to flow sequentially through the stages 
of the X and Y pipelines. As shown in clock cycles 1 34, 140, the execution portion of each instruction is per- 
formed on sequential clock cycles. This is a major advantage of a pipelined architecture, in that the number 
of instructions completed per clock is increased, without reducing the execution time of an individual instruc- 
tion. Consequently a greater instruction throughput is achieved with greater demands on the speed of the hard- 

40 ware. 

The instruction flow shown in Figure 3 is the optimum case. As shown, no stage requires more than one 
clock cycle. In an actual machine though, one or more stages may require additional clock cycles to complete 
thereby changing the flow of instructions through the other pipe stages. Furthermore, the flow of instructions 
through one pipeline may be dependent upon the flow of instructions through the other pipeline. 

45 

3. Write buffer architecture and operation 

As shown in Figure 1a, write buffer 29 is logically located at the output of core 20, and is operatively con- 
nected to core 20 by writeback buses WB_x, WB_y to receive data therefrom. Write buffer 29 is also operatively 

so connected to ATI) 50 to receive physical addresses therefrom via address buses PAx, Pay (Fig. 4). The output 
of write buffer 29 is presented to unified cache 60 by way of dual cache port 160, and is also presented to 
memory data bus DATA. Cache port 160 presents data, address and control lines to unified cache 60 in the 
conventional manner, according to the preferred embodiment of the invention, the number of lines between 
cache port 160 and unified cache 60 is sufficient to support two simultaneous write requests. 

55 As will be made further apparent hereinbelow, the function of write buffer 29 is to receive address and 
data information from core 20 that are to be written to memory, rather than to one of the registers in register 
file 24; the address and data information stored in write buffer 29 can then be later written to memory at such 
time as the cache and memory subsystems are not otherwise busy in a higher priority operation. As a result. 
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write buffer 29 allows for core 20 to rapidly perform a memory write operation (from its viewpoint) and go on 
to the next instruction in the pipeline, without disrupting memory read operations and without requiring wait 
states on the part of core 20 to accomplish the memory write. Further, the memory write operation performed 
by core 20 to write buffer 29 requires the same write cycle time, regardless of whether the memory location 

5 is in unified cache 60 or in main memory 86. 

Referring now to Figure 4, the detailed construction and operation of write buffer 29 according to the pre- 
ferred embodiment of the invention will now be described. It is to be understood that the example of write buffer 
29 described herein below, while especially advantageous in the superpipelined superscalar architecture of mi- 
croprocessor 10, can also provide significant performance and other advantages when utilized in mtcropro- 

10 cessors of different architecture. 

According to the preferred embodiment of the invention, write buffer 29 contains twelve entries 152xo 
through 152x6, 152y 0 through 152y6, organized into two sections 152x, 152y. This split organization of write 
buffer 29 in this example is preferred for purposes of layout and communication efficiency with the superscalar 
architecture of microprocessor 1 0, with write buffer sections 1 52x, 1 52y associated with the X and Y pipelines, 

15 respectively, of core 20. Alternatively, write buffer 29 could be organized as a single bank, with each entry 
accessible by either of the X and Y pipelines of core 20. 

Write buffer 29 further includes write buffer control logic 150, which is combinatorial or sequential logic 
specifically designed to control write buffer 29 and its interface wfth core 20 in the manner described herein. 
It is contemplated that one of ordinary skill in the art having reference to this specification will be readily able 

20 to realize logic for performing these functions, and as such write buffer control logic 150 is shown in Figure 4 
in block form. 

Referring now to Figure 5, the contents of a single entry 152xj in write buffer section 1 52x will now be de- 
scribed; it is to be understood, of course, that each entry 152y, of write buffer section 152y will be similarly 
constructed according to this preferred embodiment of the invention. Each entry 152x, contains an address 

25 portion, a data portion, and a control portion. In addition, each entry 152 is identified by a four bit tag value 
(not shown), as four bits are sufficient to uniquely identify each of the twelve entries 152 in write buffer 29. 
The tag is used by core 20 to address a specific entry 1 52 so as to write data thereto (or source data therefrom) 
during the EX stage and WB stage of the pipeline. By use of the four-bit tag, core 20 does not need to maintain 
the physical memory address of the write through the remainder of the pipeline. 

30 For the thirty-two bit integer architecture of microprocessor 1 0, each entry 1 52x, includes thirty-two bits 
for the storage of a physical memory address (received from ATU 50 via physical address bus PAx), and thirty- 
two bits for storage of a four-byte data word. Also according to this preferred embodiment of the invention, 
each entry 152X| further includes twenty-three various control bits, defined as noted below in Table A. These 
control bits are utilized by write buffer control logic 150 to control the allocation and issuing of entries 152. In 

35 addition, other portions of microprocessor 1 0, such as control logic in unified cache 60, are also able to access 
these control bits as necessary to perform their particular functions. The specific function of each control bit 
will be described in detail hereinbelow relative to the operation of write buffer 29. 
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10 



15 



20 



25 



AV 

DV 
RD 

MRG 



NC 
FP 

MAW 
WBNOP: 
WAR : 

SPEC :four 

XDEP: 
SIZE : size, 



Table A 

address valid; the entry contains a valid address 
data valid; the entry contains valid data 

readable; the entry is the last write in the pipeline to its physical 
address 

mergeable; the entry is contiguous and non-overlapping to the 
preceding write buffer entry 
noncacheable write 

the entry corresponds to floating point data 
misaligned write 
write buffer no-op 

write-after-read; the entry is a write occurring later in program 
order than a simultaneous read in the other pipeline 
bit field indicating the order of speculation for the 

entry 

cross-dependency map of write buffer section 152y 
in number of bytes, of data to be written 



NCRA : non-cacheable read has been previously allocated 

30 

Write buffer section 152x receives the results of either execution stage EXX of the X pipeline or execution 
stage EXY of the Y pipeline via writeback bus WB_x driven by core 20; similarly, write buffer section 152y re- 
ceives the results of either execution stage EXX of the X pipeline or execution stage EXY of the Y pipeline via 

35 writeback bus WB_y. 

Write buffer sections 152x, 152y present their contents (both address and data sections) to cache port 
1 60, for example, via circuitry for properly formatting the data. As shown in Figure 4, write buffer section 1 52x 
presents its data to barrel shifter 164x, which in turn presents its output to misaligned write latch 162x. As will 
be described in further detail hereinbeJow, misal igned write latch 1 62x allows for storage of the data from write 

40 buffer section 1 52x for a second write to cache port 1 60, which Is performed according to the present invention 
in the event that write to memory overlaps an eight-byte boundary. Misaligned write latch 162x presents Its 
output directly to cache port 160, and also to write gather latch 1 65; write gather latch 165, as will be described 
in further detail hereinbeJow, serves to gather data from multiple write buffer entries 1 52 for a single write to 
cache port 1 60, in the event that the physical addresses of the multiple writes are in the same eight-byte group. 

45 Write buffer section 1 52y presents its output to one input of multiplexer 1 63, which receives the output of 
floating point data latch 166 at its other input; as will be described hereinbeJow, floating point data latch 166 
contains the output from the FPU 70, and provides sixty-four bit floating point data storage for a memory write 
corresponding to one of write buffer entries 152. Multiplexer 163 is controlled by write buffer control logic 150 
and by the cache control logic for unified cache 60, to select the appropriate input for presentation at its output, 

so as will be described hereinbeJow. The output of multiplexer 163 is presented to shifter 164y, and in turn to mis- 
aligned write latch 162y, in similar manner as is the output of write buffer section 152x described above. The 
output of misaligned write latch 162y is also similarly connected directly to cache port 160 and also to write 
gather latch 165. 

While only a single cache port 160 is schematically illustrated in Figure 4 for simplicity of explanation, as 
55 described hereinabove, cache port 160 according to this embodiment of the invention is a dual cache port 
enabling presentation of two write requests simultaneously. In addition, write buffer 29 also communicates data 
directly to data bus DATA. As such, according to this embodiment of the invention, the connections to cache 
port 160 shown in Figure 4 will be duplicated to provide the second simultaneous write to cache port 160, and 
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will also be provided directly to data bus DATA to effect a memory write in the event that cache control requires 
a write to main memory 86. 

Also according to the preferred embodiment of the invention, write buffer 29 is capable of sourcing data 
directly from its entries 1 52 to core 20 by way of source buses SRCx, SRCy, under the control of write buffer 

5 control logic 150 which controls multiplexers 154x. 154y. The output of multiplexer 154x may be applied to 
either of the X or Y pipelines, under the control of pipeline control 28, via buses mem_x, mem_y to physical 
registers 24; similarly, the output of multiplexer 1 54y may be applied to either of the X or Y pipelines via buses 
mem_x t mem_y. In addition, writeback buses WB_x WB_y are also connected to multiplexers 154x, 154y via 
bypass buses BP_x, BP_y, respectively, so that memory bypassing of write buffer 29 is facilitated as will be 

10 described hereinbelow. 

As noted above, microprocessor 10 includes an on-chip FPU 70 for performing floating point operations. 
As noted above, the results of calculations performed by the FPU 70 are represented by sixty-four bit data 
words. According to this preferred embodiment of the invention, efficiency is obtained by limiting the data por- 
tions of write buffer entries 152 to thirty-two bits, and by providing sixty-four bit floating point data latch 166 

is for receiving data from the FPU 70. Floating point data latch 166 further includes a floating point data valid 
(FPDV) control bit which indicates, when set, that the contents of floating point data latch 166 contain valid 
data The address portion of one of write buffer entries 1 52 will contain the memory address to which the results 
from the FPU 70, stored in floating point data latch 166, are to be written; this write buffer entry 152 will have 
its FP control bit set, indicating that its data portion will not contain valid data, but that its corresponding data 

20 will instead be present in floating point data latch 166. 

Alternatively, of course, floating point data write buffering could be obtained by providing a sixty-four bit 
data portion for each write buffer entry 152. According to this embodiment of the invention, however, pre-cache 
write buffering of sixty-four bit floating point data is provided but with significant layout and chip area ineffi- 
ciency. This inefficiency is obtained by not requiring each write buffer entry 152 to have a sixty-four bit data 

25 portion; instead, floating point data latch 166 provides sixty-four bit capability for each of entry 152 in write 
buffer 29. It is contemplated that, for most applications, the frequency at which floating point data is provided 
by the FPU 70 is on the same order at which the floating point data will be retired from floating point data latch 
1 66 (i.e., written to cache or to memory). This allows the single floating point data latch 166 shown in Figure 
4 to provide adequate buffering. Of course, in the alternative, multiple floating point data latches 1 66 could be 

30 provided in microprocessor 10 if additional buffering is desired. 

The operation of write buffer 29 according to the preferred embodiment of the invention will now be de- 
scribed in detail. This operation is under the control of write buffer control logic 150, which is combinatorial or 
sequential logic arranged so as to perform the functions described hereinbelow. As noted above, it is contem- 
plated that one of ordinary skill in the art will be readily able to implement such logic to accomplish the f unc- 

35 tionality of write buffer control logic 150 based on the following description. 

Specifically, according to this embodiment of the invention, write buffer control logic 150 includes X and 
Y allocation pointers 156x, 156y, respectively, and X and Y retire pointers 158x, 158y, respectively; pointers 
156, 158 will keep track of the entries 152 in wrtte buffer 29 next to be allocated or retired, respectively. Ac- 
cordingly, sections 152x, 152y of write buffer 29 each operate as a circular buffer for purposes of allocation 

40 and retiring, and as a file of addressable registers for purposes of issuing data. Alternatively, write buffer 29 
may be implemented as a fully associative primary data cache, if desired. 

In general, upon second address calculation stages AC2 determining that a memory write will be performed 
during the execution of an instruction, one of write buffer entries 152 will be "allocated" at such time as the 
physical address is calculated in this stage, such that the physical address is stored in the address portion of 

45 an entry 152 and its address valid control bit (AV) and other appropriate control bits are set After execution 
of the instruction, and during WBX, WBY (Fig. 1a), core 20 writes the result in the data portion of that write 
buffer entry 1 52 to "issue" the write buffer entry, setting the data valid control bit (DV). The write buffer entry 
1 52 is "retired* in an asynchronous manner, in program order, by interrogating the AV and DV bits of a selected 
entry 152 and, if both are set, by causing the contents of the address and data portions of the entry 152 to 

so appear on the cache port 160 or the system bus, as the case may be. 

3.1 Allocation of write buffer entries 

Referring now to Figure 6, the process for allocation of write buffer entries 1 52 according to the preferred 
55 embodiment of the invention will now be described in detail. In this embodiment of the invention, the allocation 
process is performed as part of the second address calculation stages AC2 in both the X and Y pipelines. As 
shown by process 1 70 of Figure 6, the allocation process is initiated upon the calculation of a physical memory 
address to which results of an instruction are to be written (i.e., a memory write). 
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For ease of explanation, the sequence of Figure 6 will be described relative to one of trie sections 152x ff 
1 52y of write buffer 29. The allocation of write buffer entries 152 in the opposite section of write buffer 29 will 
be identical to that shown in Figure 6. 

Once the physical address is calculated, process 172 retrieves control bit AV from the write buffer entry 
5 1 52 to which the allocation pointer 1 56 is pointing. Each side of write buffer 29 according to this embodiment 
of the invention operates as a circular buffer, with allocation pointers 156x, 1 56y indicating the next write buffer 
entry 1 52 to be allocated for the X and Y pipelines, respectively; for purposes of this description, the write buffer 
entry 152 to which the appropriate allocation pointer 156x, 156y points will be referred to as 152^ Decision 
1 73 determines if control bit AV is set ( 1 ) or cleared (0). If control bit AV is already set, write buffer entry 1 52„ 
10 is already allocated or pending, as it has a valid address already stored therein. As such, entry 1 52„ is not avail- 
able to be allocated at this time, causing wait state 1 74 to be entered, followed by repeated retrieval and check- 
ing of control bit AV for the next entry 152*+* in process 172 and decision 173. 

If decision 1 73 determines that control bit AV for entry 1 52 n is cleared, entry 1 52„ is available for allocation 
as it is not already allocated or pending. In this case, process 176 stores the physical address calculated in 
15 process 1 70 into the address portion of entry 1 52„. 

The specific order of processes 176 through 188 shown in Figure 6 is by way of example only. It is con- 
templated that these processes may be performed in any order deemed advantageous or suitable for the spe- 
cific realization by one of ordinary skill in the art 

20 3.1.1 Read-after-multiple-write hazard handling 

According to this embodiment of the invention, certain data dependencies are detected and handled rel- 
ative to write buffer accesses. As is well known in the art, data dependencies are one type of hazard in a pi- 
pelined architecture microprocessor, that can cause errors in the program result These dependencies are even 

25 more prevalent in the superscalar superpipelined architecture of microprocessor 1 0, particularly where certain 
instructions may be executed out of program order for performance improvement Specifically, as noted here- 
inabove relative to Figure 4, and as will be described in further detail hereinbelow, write buffer 29 can source 
data to core 20 via buses SRCx, SRCy prior to retiring of an entry if the data is needed for a later instruction 
in the pipeline. Readable control bit (RD) in write buffer entries 152 assists the handling of a special type of 

do read-af ter-write (RAW) dependency, in which the pipeline contains a read of a physical memory address that 
is to be performed after multiple writes to the same physical address, and prior to the retiring of the write buffer 
entries 152 assigned to the address. According to the preferred embodiment of the invention, only write buffer 
entries 152 having their control bit RD set can be used to source data to core 20 via buses SRCx, SRCy. This 
avoids the possibility that incorrect data may be sourced to core 20 from a completed earlier write, instead of 

35 from a later allocated but not yet executed write operation to the same physical address. 

In process 1 78, write buffer control logic 1 50 examines the address fields of each previously allocated write 
buffer entry 1 52 to determine if any match the physical address which is to be allocated to entry 1 52„. According 
to the preferred embodiment of the invention, considering that the size of each read or write operation can be 
as many as eight bytes (if floating point data is to be written; four bytes for integer data in this embodiment of 

40 the invention) and that each physical address corresponds to a single byte, not only must the physical address 
values be compared in process 178 but the memory span of each operation must be considered. Because of 
this arrangement, write operations having different physical addresses may overlap the same byte, depending 
upon the size of their operations. 

Referring now to Figure 7, the method by which the physical addresses of different memory access tn- 

45 structions are compared in process 1 78 according to the preferred embodiment of the invention will be descri- 
bed in detail. To compare the write spans of two write operations, pipeline control logic 28 loads a first span 
map SPANq with a bit map in which bits are set that correspond to the relative location of bytes to which the 
write operation of the older write instruction will operate, and loads a second span map SPAN, with a bit map 
having set bits corresponding to the location of bytes to which the write operation of the newer write instruction 

so will operate. The absolute position of the set bits in span map is unimportant so long as the end bits of span 
maps SPANo* SPAN1 correspond to the same physical byte address. Figure 7 illustrates an example of span 
maps SPANo, SPAN, for two exemplary write operations. Process 178 next performs a bit-by-bit logical AND 
of span maps SPANo and SPAN ^ , producing map ANDSPAN which indicates with set bits indicating the location 
of any bytes which will be written by both of the write operations. In the example of Figure 7, two of the bits 

55 are set in map ANDSPAN, indicating that the two exemplary write operations both are writing to two bytes. 

Process 1 78 then performs a logical OR of the bits in map ANDSPAN to determine if any bits are set there- 
in. Readable control bit for entry 152„ will be set (regardless if any matching entries are found) and control bit 
RD will be cleared for any previously allocated write buffer entry 1 52 that causes the result of the logical OR 
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of the bits in map ANDSPAN to be true. Accordingly, and as will be described herein below, if a later read of 
write buffer 29 is to be performed (i.e., sourcing of data from write buffer 29 prior to retiring), only last-written 
write buffer entry 152„ will have its control bit RD set and thus will be able to present its data to core 20 via 
source bus SRCx, SRCy. Those write buffer entries 1 52 having valid data (control bit DV set) but having their 
5 control bit RD clear are prevented by write buffer control logic 150 from sourcing their data to buses SRCx, 
SRCy. 

3.1.2 Cross-dependency and retiring in program order 

10 As noted above, write buffer entries 1 52 must be retired (i.e., written to unified cache 60 or main memory 
86) in program order. For those implementations of the present invention where only a single bank of write buf- 
fer entries 152 are used, program order is readily maintained by way of a single retire pointer 158. However, 
because of the superscalar architecture of microprocessor 10, and in order to obtain layout efficiency in the 
realization of write buffer 29, as noted above this example of the invention splits writs buffer entries 152 into 

15 two groups, one for each of the X and Y pipelines, each having their own retire pointers 158x, 158y, respec- 
tively. This preferred embodiment of the invention provides a technique for ensuring retirement in program order 
between X section write buffer entries 152x and Y section write buffer entries 152y. 

Referring now to Figure 8, a map of cross-dependency control bits XDEP for a selected write buffer entry 
152xi, at the time of its allocation, is illustrated. As shown in Figure 8, each write buffer entry 152X| in the X 

20 portion of write buffer 29 has six cross-dependency control bits XDEP 0 through XDEP* each bit corresponding 
to one of the write buffer entries 152y, in the Y section 152y of write buffer 29; similarly (and not shown in 
Figure 8), each write buffer entry 152yi will have six cross-dependency control bits YDEP 0 through YDEP 5 , 
one for each of the write buffer entries 152X| in the X section 152x of write buffer 29. As illustrated in Figure 
8, the contents of each cross-dependency bit XDEP for write buffer entry 152xj corresponds to the state of 

25 control bit AV for a corresponding write buffer entry 1 52y, in the Y section 1 52y of write buffer 29, at the time 
of allocation. 

Process 180 in the allocation process of Figure 6 loads cross-dependency control bits XDEP 0 through 
XDEP 5 for write buffer entry 1 52 n that is currently being allocated, with the state of the address valid control 
bits AV for the six write buffer entries 152y, in the Y section 152y of write buffer 29 at the time of allocation. 
30 As will be described in further detail hereinbelow, as each write buffer entry 1 52 is retired, its corresponding 
cross-dependency control bit XDEP in each of the writs buffer entries 152 in the opposite portion of write buffer 
29 is cleared. Further, after a write buffer entry 152 has its cross-dependency control bits XDEP set in process 
180 of the allocation sequence, no additional setting of any of its own cross-dependency control bits XDEP 
can occur. 

35 Program order is thus maintained by requiring that, in order to retire a write buffer entry 1 52, all six of its 
cross-dependency control bits XDEP 0 through XDEP 5 must be cleared (i.e., equal to 0). Accordingly, the setting 
of cross-dependency control bits XDEP in process 180 takes a "snapshot" of those write buffer entries 152 in 
the opposite portion of write buffer 29 that are previously allocated (i.e., ahead of the allocated write buffer 
entry 152„ in the program sequence). The combination of the cross-dependency control bits XDEP and retire 

40 pointers 1 58x, 158y ensure that write buffer entries 1 52 are retired in program order. 

In similar manner, as will be described in detail hereinbelow, microprocessor 10 may include provisions 
for performing norvcacheable reads from main memory 86, which must be performed in program order. The 
presence of a previously allocated norvcacheable read is indicated for each write entry by norvcacheable read 
allocation control bit (NCRA) being set; upon execution of the norvcacheable read, control bit NCRA is cleared 

45 for all write buffer entries 1 52. The setting and clearing of control bit NCRA is performed in the same manner 
as cross-dependency control bits XDEP described hereinabove, to ensure that the norvcacheable read is per- 
formed in the proper program order. 

3.1.3 Completion of allocation process 

50 

Process 182 is then performed in the allocation of write buffer entry 1 52 fl , in which certain control bits in 
write buffer entry 152„ are set according to the specific attributes of the memory write to be accomplished 
thereto. Write size control bits (SIZE) are set with the number of bytes of data (up to eight bytes, thus requiring 
three write size control bits SIZE) that are to be written to write buffer entry 1 52 fl , as indicated in the instruction. 
55 Others of the control bits in write buffer entry 15% are also set in process 182 to control the operation of 
microprocessor 10 in the use of write buffer entry 152^ While the specif ic control effected in this embodiment 
of the invention based upon the state of these bits wiD be described in detail hereinbelow, the following is a 
summary of the nature of these control bits. Norvcacheable write control bit (NC) is set if the memory write 
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operation is to be non-cacheable. Mergeable control bit (MRG) is set for write buffer entry 152„ if the physical 
memory locations corresponding thereto are contiguous and non-overlapping with the memory locations cor- 
responding to a previously allocated write buffer entry 1 52^ such that a gathered write operation may be per- 
formed. Write-after-read control bit (WAR) is set if the write operation to write buffer entry 152^ is to be per- 

5 formed after a simultaneous read in the other pipeline. Misaligned write control bit (MAW) is set if the length 
of the data to be written to the physical address stored in write buffer entry 1 52„ crosses an eight-byte boundary 
(in which case two write cycles will be required to retire write buffer entry 152 n ). Control bit NCRA is set if a 
non-cacheable read has previously been allocated and not yet performed. 

Once the storing of the physical address and the setting of the control bits in write buffer entry 152,, is 

10 complete, control bit AV for write buffer entry 152„ is set in process 184. In addition, if not previously cleared 
by a previous retire operation, control bit DV is cleared at this time. The setting of control bit AV indicates the 
allocation of write buffer entry 152„ to subsequent operations, including the setting of cross-dependency con- 
trol bits XDEP upon the allocation of a write buffer entry 152 in the opposite section of write buffer 29. 

In process 1 86, write buffer control logic 1 50 returns the tag value of now-allocated write buffer entry 1 52n 

1 5 to core 20. Core 20 then uses this four bit tag value in its execution of the instruction, rather than the full thirty- 
two bit physical address value calculated in process 1 70. The use of the shorter tag value facilitates the exe- 
cution of the instruction, and thus improves the performance of microprocessor 10. 

The allocation sequence is completed in process 188, in which allocation pointer 156x, 156y (depending 
upon whether write buffer entry 152„ is in the X or Y sections 152x, 1 52y of write buffer 29) is incremented to 

20 point to the next write buffer entry 152 to be allocated. Control then passes to process 190, which is the as- 
sociated EX stage in the pipeline, if the instruction associated with the write is not prohibited from moving for- 
ward in the pipeline for some other reason. 

3.2 Issuing of data to write buffer entries 

25 

Referring now to Figure 9, the process of issuing data to write buffer entries 1 52 will be described in detail 
relative to a selected write buffer entry 152,. As noted above, the issue of data to write buffer 29 is performed 
by core 20 after completion of the EX stage of the instruction, and during one of WB stages depending upon 
whether operation is in the X or the Y pipeline. 

30 The issue sequence begins with process 192, in which core 20 places the data to be written to write buffer 
29 on the appropriate one of writeback buses WB_x t WB_y, depending upon which of the X or Y pipelines is 
executing the instruction. Core 20 is also communicating the tag of the destination write buffer entry 152 to 
write buffer control logic 150. Write buffer control logic 150 then enables write buffer entry 152,, which is the 
one of write buffer entries 152 associated with the presented tag value, to latch in the data presented on its 

35 associated writeback bus WB_x, WB_y, in process 1 94. Once the storage or latching of the data in write buffer 
entry 152, is complete, control bit DV is set in process 196, ending the issuing sequence. 

Once write buffer entry 152, has both its control bit AV and also its control bit DV set, write buffer entry 
152, is in its "pending* state, and may be retired. As noted above, the retiring of a write buffer entry 152 is 
accomplished on an asynchronous basis, under the control of cache logic used to operate unified cache 60, 

40 such that the writing of the contents of write buffer entries 152 to unified cache 60 or main memory 86 occurs 
on an as available basis, and does not interrupt or delay the performance of cache or main memory read op- 
erations. Considering that memory reads are generally of higher priority than memory writes, due to the de- 
pendence of the program being executed upon the retrieval of program or data from memory, write buffer 29 
provides significant performance improvement over conventional techniques. 

45 

3.3 Retiring of write buffer entries 

Referring now to Figure 10, the sequence by way of which write buffer entries 152 are retired under the 
control of cache control logic contained within or provided in conjunction with unified cache 60 will now be 
so described in detail. Certain special or complex write operations will be described in specific detail here in below. 
As such, the retiring sequence of Figure 10 is a generalized sequence. 

3.3.1 Retiring of Integer write buffer data 

55 As noted above, the retiring sequence of Figure 10 is performed under the control of cache control logic 
contained within or in conjunction with unified cache 60, and is asynchronous relative to the operation of the 
X and Y pipelines. As noted above, it is important that write buffer entries 152 be retired in program order. Ac- 
cordingly, write buffer 29 operates as a circular buffer with the sequence determined by retire pointers 158x, 
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158y for the two portions of write buffer 29. Retire pointers 158x, 158y maintain the program order of write 
buffer entries 152 in their corresponding sections 152x, 152y of write buffer 29, and cross-dependency control 
bits XDEP maintain order of entries 152 between sections 152x, 152y, as will be noted from the following de- 
scription. 

5 For ease of explanation, as in the case of the allocation sequence described hereinabove, the sequence 

of Figure 1 0 will be described relative to one of the sections 1 52x, 1 52y of write buffer 29. The retiring sequence 
for the opposite section 152x, 152y of write buffer 29 will be identical. 

The retiring sequence begins with process 200, in which control bit FP, control bit DV and control bit AV 
are retrieved from write buffer entry 1 52,, which is the one of write buffer entries 1 52 that retire pointer 1 58 is 

10 indicating as the next entry 1 52 to be retired. In decision 201 , control bit FP and control bit AV are tested to 
determine if write buffer entry 1 52, is associated with floating point data latch 1 66 (and thus is buffering floating 
point results from the FPU 70). If both control bit FP and control bit AV are set, write buffer entry 152, is as- 
sociated with floating point data and the data will be retired according to the process described in section 3.3.2 
hereinbelow. 

15 If control bit AV is set and floating point control bit FP is clear, write buffer entry 1 52, is directed to integer 
data. Decision 202 is next performed, in which the cache control logic determines if control bit AV and control 
bit DV are both set If not, (either of AV and DV being clear), entry 1 52 r is not ready to be retired, and control 
passes to process 200 for repetition of the retrieval and decision processes. If both are set, valid integer data 
is present in the data portion of write buffer entry 1 52 n and the entry may be retirable. 

20 Decision 204 is then performed to determine if cross-dependency control bits XDEP are all clear for write 

buffer entry 1 52 r As described hereinabove, cross-dependency control bits XDEP are a snapshot of the con- 
trol bits AV for the write buffer entries 152 in the opposite section of write buffer 29 beginning at allocation of 
write buffer entry 152,, and updated upon the retirement of each write buffer entry 1 52. If all of the cross- 
dependency control bits XDEP are clear for write buffer entry 152r (and retire pointer 158 is pointing to it), 

25 write buffer entry 152, is next in program order to be retired, and control passes to process 208. 

If cross-dependency control bits XDEP are not all clear, than additional write buffer entries 152 in the op- 
posite section of write buffer 29 must be retired before entry 152y may be retired, so that program order may 
be maintained. Wait state 206 is effected, followed by repetition of decision 204, until the write buffer entries 
1 52 in the opposite section that were allocated prior to the allocation of write buffer entry 1 52, are retired first 

30 As will be described in detail hereinbelow, microprocessor 10 may include provisions for performing non- 
cacheable reads from main memory 86, which must be performed in program order. The presence of a previ- 
ously allocated non-cacheable read is indicated for each write entry by control bit NCRA being set; upon exe- 
cution of the non-cacheable read, control bit NCRA is cleared for all write buffer entries 15Z if this feature is 
implemented, decision 204 will also test the state of control bit NCRA, and prevent the retiring of write buffer 

35 entry 1 52, until such time as both all cross-dependency control bits XDEP and also control bit NCRA are clear. 
Process 208 is then performed, in which the data section of write buffer entry 152, is aligned with the ap- 
propriate bit or byte position for presentation to cache port 160 or to the memory bus. This alignment is nec- 
essary considering that the physical memory address corresponds to specific byte locations, but the data is 
presented in up to sixty-four bit words (eight bytes). As such, alignment of data with the proper bit positions is 

40 important to ensure proper memory write operations. In addition, special alignment operations such as required 
for gathered writes and for misaligned writes are accomplished in process 208. Details of these alignment fea- 
tures and sequences are described hereinbelow. 

Process 21 0 then forwards the data portion of write buffer entry 1 52, to cache port 1 60, whether directly 
or via the special write circuitry shown in Figure 4. Once this occurs, the one of cross-dependency control bits 

45 XDEP corresponding to write buffer entry 1 52, is cleared in each write buffer entry 1 52| in the opposite section 
of write buffer 29 (in process 212). This allows the next write buffer entry 152 in sequence (Le., the write buffer 
entry 1 52, pointed to by the opposite retire pointer 1 58) to be retired in the next operation. Process 214 clears 
control bit AV and control bit DV for the write buffer entry 1 52, currently being retired. Process 21 6 then incre- 
ments retire pointer 158 for its section to enable the retirement of the next write buffer entry 152 in sequence, 

so and allow re-allocation of write buffer ent ry 1 52p Control of the retiring sequence then passes back to process 
200 for retrieval of the appropriate control bits. 

As noted above, while a single cache port 160 is schematically illustrated in Figure 4 and discussed relative 
to process 21 0 hereinabove, cache port 1 60 serves as a dual cache port and write buffer 29 in microprocessor 
10 of Figure 1a is also in communication directly with data bus DATA. Accordingly, in this case, the cache con- 

55 troJ logic will select the appropriate port to which write buffer 29 presents data from entry 152, in process 210. 

Furthermore, the provision of dual cache port 160 allows for additional streamlining in the case where two 
sections of write buffer 29 are provided, as shown in Figure 4, as data may be presented from two write buffer 
entries 152 (one in each of the X and Y sections 152x, 152y of write buffer 29) simultaneously via the dual 
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cache port 1 60. If such simultaneous presentation of data is provided, the cross-dependency decision 204 must 
allow for one of the write buffer entries 152 to have a single set cross-dependency control bit XDEP, so long 
as the simultaneously presented write buffer entry 152 corresponds to the set XDEP bit The retiring process 
may thus double its output rate by utilizing the two sections 152x, 152y of write buffer 29. 

5 

3.3.2 Retire of floating point write buffer data 

If decision 201 determines that both control bit AV and control bit FP are set, write buffer entry 152, to 
which retire pointer 158 points is associated with floating point results from the FPU 70. According to this em- 

10 bodiment of the invention, control bit DV for entry 1 52 r will also be set despite the absence of valid integer 
data therein, for purposes of exception handling as will be described hereinbelow. 

Decision 203 is then performed, by way of which the cache control logic interrogates control bit FPDV of 
floating point data latch 166 to see if the FPU 70 has written data thereto, in which case control bit FPDV will 
be set Control bit FPDV is analogous to control bit DV of write buffer entries 152, as it indicates when set that 

15 the FPU 70 has written valid data thereto. Conversely, if control bit FPDV is clear, the FPU 70 has not yet written 
data to floating point data latch 166, in which case decision 204 will return control to process 200 in the retire 
sequence of Figure 10. 

If control bit FPDV is set decision 205 is then performed by way of which cross-dependency control bits 
XDEP of write buffer entry 152 r are interrogated to see if all bits XDEP are cleared. If not, additional write buffer 

20 entries 152 that were allocated in program order prior to entry 152,., and that reside in the opposite section of 
write buffer 29 from entry 1 52p must be retired prior to entry 1 52, being retired. Wait state 207 is then executed, 
and decision 205 is repeated. Upon all cross-dependency control bits XDEP of entry 152, becoming clear, de- 
cision 205 passes control to process 208, for alignment and presentation of the contents of floating point data 
latch 166 to cache port 160. As noted above, if simultaneous presentation of two write buffer entries 152 are 

25 allowed via dual cache port 1 60, one of the entries 1 52 may have a single set XDEP bit so long as it corresponds 
to the simultaneously presented entry of the par. 

Cross-dependency control bits XDEP in opposite section entries 152 are then cleared (process 212), con- 
trol bit AV and control bit FPDV are cleared (process 214), and retire pointer 1 58 is incremented (process 21 6), 
as in the case of integer data described hereinabove. 

30 

3 A Ordering of non-cacheable reads 

The cross-dependency scheme used in the allocation of write buffer entries 152 described hereinabove 
may also be used for other functions in microprocessor 10. Similarly as for non-cacheable writes described 

35 hereinbelow, microprocessor 10 may have instructions in its program sequence that require non-cacheable 
reads from memory. By way of definition, a non-cacheable read is a read from main memory 86 that cannot 
by definition be from unified cache 60; the non-cacheable read may, for purposes of this description, be con- 
sidered as a single entry read buffer that serves as a holding latch for requesting a read access to main memory 
86. In order to ensure proper pipeline operation, non-cacheable reads must be executed in program order. Ac- 

40 cordingly, especially in the case of superpipelined superscalar architecture microprocessor 1 0 described here- 
in, a method for maintaining the program order of non-cacheable reads is necessary. 

Referring now to Figure 17, non-cacheable read cross-dependency field 310 according to the preferred 
embodiment of the invention is illustrated. Non-cacheable read cross-dependency field 310 is preferably main- 
tained in cache control logic of unified cache 60, and includes allocated control bit NCRV which indicates, when 

45 set that a non-cacheable read has been allocated. Similarly as cross-dependency control bits XDEP described 
hereinabove, and as described above, control bit NCRA in each write buffer entry 1 52 is set at the time of its 
aDocation, if allocated control bit NCRV is set indicating that a non-cacheable read is previously allocated. 
Control bit NCRA is tested during the retiring of each write entry 152 to ensure proper ordering of requests to 
main memory 86. 

so in addition, non-cacheable read cross-dependency field 310 contains one bit position mapped to each of 
the control bits AV of each write buffer entry 152, to indicate which of write buffer entries 152 are previously 
allocated at the time of allocation of the non-cacheable read, and to indicate the retirement of these previously 
allocated write buffer entries 152. Non-cacheable read cross-dependency field 310 operates in the same man- 
ner as cross-dependency control bits XDEP, with bits set only upon allocation of the non-cacheable read, and 

55 cleared upon retirement of each write buffer entry. 

Referring now to Figures 1 8a and 1 8b, the processes of allocating and retiring a non-cacheable read op- 
eration according to the preferred embodiment of the invention will now be described in detail. In Figure 18a, 
the allocation of non-cacheable read is illustrated by process 312 first determining that an instruction includes 
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a non-cacheabte read. Process 314 is then performed by way of which a snapshot of the control bits AV are 
loaded into non-cacheabie read cross-dependency field 310. Process 316 is then performed, in which allocated 
control bit NCRV in non-cacheable read cross-dependency field 310 is set, indicating to later-allocated write 
buffer entries 152 that a non-cacheable read operation has already been allocated. Address calculation stage 

5 AC2 then continues (process 318). 

Figure 18b illustrates the performing of the non-cacheable read, under the control of the control logic of 
unified cache 60. Decision 319 determines if non-cacheable read cross-dependency field 310 is fully clear. If 
any bit in non-cacheable read cross-dependency field 310 is set, one or more of the write buffer entries 152 
allocated previously to the non-cacheable read has not yet been retired; wait state 321 is then entered and 

10 decision 31 9 repeated until all previously allocated write buffer entries have been retired. 

Upon non-cacheable read cross-dependency field 310 being fully clear, the non-cacheable read is next 
in program order to be performed. Process 320 is then executed to effect the read from main memory 86 in 
the conventional manner. Upon completion of the read, allocated control bit NCRV in non-cacheable read 
cross-dependency field 310 is cleared in process 322, so that subsequent allocations of write buffer entries 

is 1 52 will not have their control bits NCRA set Process 324 then clears control bits NCRA in each of write buffer 
entries 152, indicating the completion of the non-cacheable read and allowing retiring of subsequent write buf- 
fer entries 152 in program order. 

Considering that control bits NCRA in write buffer entries 1 52, taken as a set, correspond to non-cacheable 
read cross-dependency field 31 0, it is contemplated that the use of a single set of these indicators can suffice 

20 to control the program order execution of the non-cacheable read. For example, if only non-cacheable read 
cross-dependency field 310 is used, allocation and retiring of write buffer entries 152 would be controlled by 
testing field 310 to determine if a non-cacheable read has been allocated, and by testing the corresponding 
bit position in field 310 to determine if the particular write buffer entry 152 was allocated prior to or after the 
non-cacheable read. 

25 Therefore, according to this preferred embodiment of the invention, non-cacheable read operations can 
be controlled to be performed in program order relative to the retiring of write buffer entries 152. 

4. Read-af ter- write hazard detection and write buffer operation 

30 As discussed above, certain hazards are inherent in pipelined architecture microprocessors, and particu- 
larly in superpipelined superscalar microprocessors such as microprocessor 1 0. An important category of such 
hazards are data dependencies, which may occur if multiple operations to the same register or memory location 
are present in the pipeline at a given time. 

A first type of data dependency is the RAW, read-af ter-write, data dependency, in which a write and a read 

35 to the same memory location are present in the pipeline, with the read operation being a newer instruction 
than the write. In such a case, the programmer has assumed that the write will be completed before the read 
is executed. Due to pipeline operation, however, the memory access for the read operation may be performed 
prior to the execution of the write, particularly if the read operation is implicit in another instruction such as 
an add or multiply. In this event the read will return incorrect data to the core, since the write to the memory 

40 location has not yet been performed. This hazard is even more likely to occur in a superscalar superpipelined 
architecture of microprocessor 10, and still more likely if instructions can be executed out of program order, 
as described above. 

Referring to Figure 11, the sequence of detecting and handling RAW hazards in microprocessor 10 ac- 
cording to the preferred embodiment of the invention will now be described in detail. In this example, RAW 

45 hazard detection occurs as a result of physical address calculation process 21 8 performed in the second ad- 
dress calculation stage AC2 of the X and Y pipelines for each read instruction. In decision 219, write buffer 
control logic 150 compares the read physical address calculated in process 218 against each of the physical 
address values in all write buffer entries 152, regardless of pipeline association. This comparison not only com- 
pares the physical address of the read access to those of the previously allocated addresses, but also considers 

so the span of the operations, in the manner described hereinabove relative to process 178 in Figures 6 and 7. 
This comparison is also performed relative to the instruction currently in the second address calculation stage 
of the opposite X or Y pipeline. If there is no overlap of the read operation with any of the writes that are either 
previously allocated, or simultaneously allocated but earlier in program order, no RAW hazard can exist for that 
particular read operation, and execution continues in process 222. rf decision 219 determines that there is a 

55 match between the physical address calculated for the read operation and the physical address for one or more 
write buffer entries 152w that is allocated for an older instruction and has its address valid control bit AV set 
or that is allocated for a simultaneously allocated write for an older instruction, a RAW hazard may exist and 
the hazard handling sequence illustrated in Figure 11 continues. 
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As noted above, one of the control bits for each write buffer entry 152 is write-after-read control bit WAR. 
This control bit indicates that the write operation for which a write buffer entry 152 is allocated is a write-after- 
read, in that it is a write operation that is to occur after an older (in program order) read instruction that is in 
the second address calculation stage AC2 of the opposite pipeline at the time of allocation. Control bit WAR 

5 is set in the allocation sequence (process 182 of Figure 6) if this is the case. This prevents lockup of micro- 
processor 10 if the newer write operation executes prior to the older read operation, as the older read operation 
would, upon execution, consider itself a read-af ter-write operation that would wait until the write is cleared; 
since the write operation is newer than the read and will wait for the read to clear, though, neither the read nor 
the write would ever be performed. Through use of control bit WAR, microprocessor 10 can determine if an 

w apparent RAW hazard is in fact a WAR condition, in which case the write can be processed. 

Accordingly, referring back to Figure 11, decision 221 determines if control bit WAR is set for each write 
buffer entry 152* having a matching physical address with that of the read, as determined in decision 219. 
For each entry 152* in which the WAR bit is set, no RAW conflict exists; accordingly, if none of the matching 
entries 152* have a clear WAR bit, execution of the read continues in process 222. However, for each matching 

is write buffer entry 1 52* in which write control bit WAR is not set a RAW hazard does exist and the hazard han- 
dling sequence of Figure 11 will be performed for that entry 152*. Of course, other appropriate conditions may 
also be checked in decision 221, such as the clear status of the write buffer no-op control bit (WBNOP), and 
the status of other control bits and functions as may be implemented in the particular realization of the present 
invention. 

20 Decision 223 is next performed in which the control bit AV, address valid, is tested for each RAW entry 

1 52*. Decision 223 is primarily performed to determine if those RAW entries 152* causing wait states for the 
read operation (described below) have been retired. If no remaining RAW entries 152* have their control bits 
AV set, the RAW hazard has been cleared and the read operation can continue (process 222). 

For each of the remaining matching RAW entries 152*, process 224 is next performed to determine if the 

25 entry is bypassable, or if the write causing the hazard must be completed prior to continuing the read operation. 
According to the preferred embodiment of the invention, techniques are available byway of which unified cache 
60 and, in some cases write buffer 29, need not be written with the data from the write prior to sourcing of the 
data to the read operation in core 20. 

Such bypassing is not available for all writes, however. In this example, the results of non-cacheable writes 

30 (indicated by non-cacheable control bit, NC, being set in entry 152) must be sou reed from main memory 86. 
Secondly, as discussed hereinabove, a special case of RAW hazard is a read after multiple writes to the same 
physical location. As shown in Figure 6, process 178 of the allocation sequence sets control bit, RD, of a write 
buffer entry 152 and clears control bit RD of all previously allocated write buffer entries to the same physical 
address. Conversely, those write buffer entries 1 52 that are not readable (i.e., their control bit RD is clear) cart- 
as not be used to source data to core 20, as their data would be in error. Thirdly, data cannot be sou reed from a 
write operation if the subsequent read encompasses bytes not written in the write operation, as an access to 
cache 60 or main memory 86 would still be required to complete the read. 

In the RAW handling sequence of Figure 1 1 , process 224 is performed on each matching write buffer entry 
152* to determine if the control bit RD for entry 152* is set (indicating that entry 152* is the last entry 152 

40 allocated to the physical address of the read), to determine if the control bit NC is clear (indicating that the 
write is not non-cacheable), and also to determine if the physical address of the read is an "exact" match to 
that of the write to write buffer entry 152*, in that the bytes to be read are a subset of the bytes to be written 
to memory. An entry 152* for which all three conditions are met are said to be "bypassable", and control passes 
to decision 225 described below. If no bypassable entry 152* exists, as one or more of the above conditions 

45 (non-cacheable, non-readable, or non-exact physical address) are not met, wait state 229 is effected and con- 
trol passes back to decision 223; this condition will remain until ail non-bypassable entries 152* are retired as 
indicated by their control bits AV being dear, after which the read operation may continue (process 222). 

In this embodiment of the invention, the method of bypassing applicable to each bypassable entry 152* 
is determined in decision 225, in which control bit DV, data valid, is tested to determine if write buffer entry 

so 1 52* is pending (i.e., contains valid data) but not yet retired. For each bypassable entry 152* that is pending, 
process 230 is performed by write buffer control logic 150 to enable the sourcing of the contents of the data 
portion of write buffer entry 152* directly to core 20 without first having been written to memory. Referring to 
Figure 4, process 230 is effected by write buffer control logic 150 enabling write buffer entry 152*, at the time 
of the read operation, to place its data on its source bus SRC (i.e., the one of buses SRCx, S RCy for the section 

55 of write buffer 29 containing entry 152*) and by controlling the appropriate multiplexer 154 to apply source 
bus SRC to the one of the X or Y pipelines of core 20 that is requesting the data. In this case, therefore, the 
detection of a RAW hazard is handled by sourcing data from write buffer 29 to core 20, speeding up the time 
of execution of the read operation. 
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For those bypassabie write buffer entries 152^ that are not yet pending, however, as indicated by decision 
225 finding that control bit DV is not set, valid data is not present in entry 152*, and cannot be sourced to core 
20 therefrom. Process 232 is performed for these entries 1 52* so that, at the time mat the write by core 20 to 
write buffer entry 1 52* occurs, the valid data on writeback bus WB_x or WB_y (also present on the correspond- 
5 ing bypass bus BP_x, BP_y and applied to the appropriate one of multiplexers 154x, 154y) will be applied to 
the requesting X or Y pipeline in core 20. In this way, the RAW hazard is handled by bypassing write buffer 29 
with the valid data, further speeding the execution of the read operation, as the storing and retrieval of valid 
data from cache 60, main memory 86, or even the write buffer entry 152* are not required prior to sourclng of 
the data to core 20. 

10 

5. Speculative execution and exception handling 
5.1 Speculative execution 

15 As noted above, superpipelined superscalar microprocessor 1 0 according to the preferred embodiment of 
the invention is capable of executing instructions in a speculative manner. The speculation arises from the 
execution of one or more instructions after a conditional branch or Jump statement, prior to determining the 
state of the condition upon which the jump or branch is based. Without speculative execution, the micropro- 
cessor would have to wait for the execution of the instruction that determines the state of the condition, prior 

20 to execution of any subsequent instructions, resulting in a pipeline "stall" condition. In speculative execution, 
microprocessor 1 0 speculates to the state of the condition, and executes instructions based on this speculation. 
The effect of pipeline stalls is reduced significantly, depending upon the number of speculative executions un- 
dertaken and the rate at which the speculation is accurate. 

Microprocessor 10 according to this embodiment of the invention includes circuitry for rapidly clearing the 

25 effect of unsuccessful speculation, particularly in ensuring that the results of speculative writes are not retired 
to memory and in removing the speculatively written data from write buffer 29. Referring now to Figures 12a 
and 12b, a method for executing speculative writes and handling unsuccessful speculation will now be descri- 
bed in detail. The flow diagrams of Figures 12a and 12b illustrate this method by way of example, rather than 
in a generalized manner, it is contemplated that one of ordinary skill in the art having reference to the following 

30 description of this example will be able to readily implement the method of Figures 12a and 12b in a micropro- 
cessor realization. 

The exemplary sequence of Figure 12a begins with process 240, in which core 20 selects a series of in- 
structions to be performed in a speculative manner, in that the series of instructions correspond to one result 
of a conditional branch where the condition is not yet known. The determination of which of the conditional 

35 branches (i.e., whether or not to take the conditional branch or jump) to select may be made according to con- 
ventional predictive branching schemes. In process 242, allocation of two write buffer entries 152a, 152b (the 
speculative branch including two write operations to memory, in this example) is performed in the second ad- 
dress calculation stage AC2 of the pipeline, as described hereinabove. However, because the write operations 
to write buffer entries 1 52a, 1 52b is speculative, at least one of the speculation control bits (SPEC)is set during 

40 the allocation of process 242, depending upon the order of speculation of the write. 

In this embodiment of the invention, four orders of speculative execution are permitted. The order, or de- 
gree, of speculation is indicated for each write buffer entry 152 by the four j, k, I, m SPEC control bits (SPEC 
bits), with each bit position corresponding to whether the write buffer entry 152 is a speculative write for one 
of the selected conditional branches. Figure 12a illustrates the condition of four write buffer entries 152a, 152b, 

45 1 52c, 1 52d after the allocation of process 242. As shown in Figure 1 2a, write buffer entries 1 52a, 1 52b allocated 
in process 242 have their j SPEC bit set Because the allocation of process 242 is for first order speculation 
(i.e., it is the first speculation made in this example), only the single j SPEC control bit is set for entries 152a, 
1 52b. Write buffer entries 1 52c, 1 52d are not yet allocated, and as such their speculation control bits are clear. 
After the allocation of process 242, initiation of the execution of the speculative instructions in the selected 

so conditional branch begins in process 244. The execution of these instructions will, if completed, effect the 
writes to allocated write buffer entries 152a, 152b, such that their control bits DV become set Because the 
execution of these writes is speculative, however, the retire sequence described relative to Figure 10 should 
also include (where speculative execution is incorporated) a gating decision preventing the retiring of a write 
buffer entry 152 unless its control bits SPEC are all clear. This prevents the results of speculative execution 

55 from reaching memory, where it is more difficult and time-consuming, if possible at all, to recover in the event 
that the speculative prediction was incorrect (Le., the other branch from that selected in process 240 should 
have been taken). 

In the example of Figure 12a, second order speculation also occurs, such that one of the instructions in 
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the branch selected in process 240 included another conditional branch or jump, for which predictive branch 
selection is again performed in process 246 to keep the pipeline from stalling. Second order speculation means 
that in order for the execution of the instructions for the branch selected in process 246 to be successful, not 
only must the selection in process 246 be correct but the selection in process 240 must also be correct While 

5 process 246 is shown in Figure 12a as occurring after the execution of the instructions in process 244, due to 
the superpipelined architecture of microprocessor 10 described hereinabove, the predictive branching of proc- 
ess 246 will often occur prior to completion of the execution initiated in process 244. Following selection of 
the branch in process 246, write buffer entry 1 52c is allocated in process 248 (again during the second address 
calculation pipeline stage). In this allocation of process 246, since any write to write buffer entry 152c is of 

10 second order speculation, both the j and k SPEC control bits are set The state of control bits SPEC for write 
buffer entries 152a, 152b, 152c, 152d after process 246 is shown in Figure 12a. Execution of the speculative 
instructions in the branch selected in process 246 is then initiated in process 250. 

In the example of Figure 12a, third order speculation is also undertaken, meaning that the sequence of 
instructions in the branch selected in process 246 also includes another conditional branch or jump. Process 

15 252 selects one of the branches according to predictive branch selection; however, in order for this third order 
selection to be successful, all three of the selections of processes 240, 246 and 252 must be successful. Again, 
as before, process 252 may make the selection of the branch prior to completion of the execution of the in- 
structions in process 250, considering the superpipelined architecture of microprocessor 10. In this example, 
write buffer entry 152d is allocated in process 254, with the three j, k and I SPEC bits set in write buffer entry 

20 152d. The state of the control bits SPEC for write buffer entries 152a through 152d after process 254 is illu- 
strated in process 254. Process 256 then executes the instructions of the branch selected in process 252, in- 
cluding a write operation to write buffer entry 152d. 

Referring now to Figure 12b, an example of the handling of both successful and unsuccessful speculative 
execution by write buffer 29 will now be described. As in the example of Figure 12a, the sequence of Figure 

25 1 2b is by way of example only rather than for the general case, but it is contemplated that one of ordinary skill 
in the art will be able to readily realize the method in a microprocessor architecture. 

In process 260, core 20 detects that the first selection of process 240 was successful, such that the con- 
dition necessary to cause the branch (or non-branch) to the instructions executed in process 244 was satisfied 
in a prior instruction. Accordingly, the contents of the data portions of write buffer entries 1 52a, 1 52b allocated 

30 in process 242 and written in process 244 may be retired to memory, as their contents are accurate results of 
the program being executed. In process 262, therefore, the j SPEC bits of all speculative write buffer entries 
152a, 152b, 152c, 152d are cleared; the state of control bits SPEC for write buffer entries 152a through 1 52d 
after process 262 is illustrated in Figure 12b. Since write buffer entries 1 52a, 1 52b now have all of their spec- 
ulation control bits SPEC clear (and since its data valid control bit DV was previously set), write buffer entries 

35 152a, 152b may be retired to unified cache 60 or main memory 86, as the case may be. 

In the example of Figure 12b, the second branch selection (made in process 246) is detected to be unsuc- 
cessful, as the condition necessary for the instructions executed in process 248 was not satisfied by the prior 
instruction. Furthermore, since the selection of the branch made in process 252 also depended upon the suc- 
cessful selection of process 246, the condition necessary for the instructions to be executed in process 256 

40 also will not be satisfied. To the extent that the writes to write buffer entries 1 52c, 1 52d have not yet been per- 
formed, these writes will never be performed, because of the unsuccessful predictive selection noted above; 
to the extent that these writes occurred (i.e., write buffer entries 1 52c, 1 52d are pending), the data should not 
be written to memory as it is in error. Accordingly, write buffer entries 1 52c, 1 52d must be cleared for additional 
use, without retiring of their contents. 

45 The sequence of Figure 12b handles the unsuccessful speculative execution beginning with process 266, 

in which those write buffer entries 152 having their k SPEC bit set are identified by write buffer control logic 
150. In this example, these identified write buffer entries 152 are entries 152c (second order speculation) and 
152d (third order speculation). In process 268, write buffer control logic 150 clears the address valid control 
bits AV for each of entries 1 52b, 1 52c, such that entries 1 52c, 1 52d may be reallocated and will not be retired 

50 (see the retire sequence of Figure 10, in which the AV bit must be set for retiring to take place). 

As described hereinabove, retire pointers 158x, 158y point to the ones of write buffer entries 152 next to 
be retired. According to the preferred embodiment of the invention control bits WBNOP are set for write buffer 
entries 1 52c, 1 52d, such that when the associated retire pointer 1 58 points to entries 1 52c, 1 52d, these entries 
will be skipped (as though they were never allocated). This allows for retire pointers 158 to "catch up" to a] lo- 
ss cation pointers 1 56 if their section of write buffer 29 is empty. Repeated checking of the address valid control 
bits AV in the retire process can then safely stop, once the empty condition has been met 
Execution of the proper conditional branch can resume in process 270 shown in Figure 12b. 
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5.2 Exception handling 

In addition to speculative execution, pipeline stalls and bubbles may occur in the event that execution of 
an instruction returns an error condition, commonly referred to as an exception. An example of an exception 

5 is where core 20 detects a divide-by-zero condition. When such an exception is detected in the execution stage 
of the pipeline, the instructions still in the pipeline must be cleared in order for the exception condition to be 
properly handled in the conventional manner. Specifically relative to write buffer 29, those write buffer entries 
152 which were allocated after the instruction resulting in an exception must be flushed. Since the writes to 
these entries 152 will never occur (and data valid control bit DV would never be set) because of the removal 

10 of the write instructions from the pipeline, entries 152 would never retire from write buffer 29 if not otherwise 
flushed; microprocessor 10 would then hang indefinitely, waiting for data that would never arrive. 

Referring now to Figure 13, an example of a sequence for handling exceptions relative to write buffer 29 
will now be described in detail. In process 272, core 20 detects an exception condition. Process 274 is then 
performed by write buffer control logic 150, in which the control bit AV and control bit DV are retrieved from 

15 each write buffer entry 1 52 in write buffer 29. Decision 273 then determines if any of the control bits AV are 
set in write buffer 29. For each write buffer 1 52 that has its control bit AV set, decision 275 tests its control bit 
DV, data valid, to determine if it is set If not (meaning that the write to that entry 152 had not yet occurred at 
the time of the exception), control bit AV is cleared and write buffer no-op bit WB NOP is set for that entry 1 52. 
As described hereinabove, the WBNOP bit indicates that retire pointers 158 can skip this entry 152, such that 

20 the empty condition where aDocation pointers 1 56x, 1 56y equal their respective retire pointers 1 58x, 1 58y can 
be achieved. Control is then returned to process 274 as will be described hereinbelow. 

For those pending write buffer entries having both their control bits AV and control bits DV set (as deter- 
mined by decisions 273, 275), data was written by core 20 prior to the exception condition. As such, data written 
to these locations is valid, and can be written to memory in the normal asynchronous retiring sequence as de- 

25 scribed hereinabove relative to Figure 1 0. However, prior to the processing of the exception by microprocessor 
1 0, all entries of write buffer 29 must be retired and available for allocation (i.e., write buffer 29 must be empty). 
Control of the sequence thus returns to process 274, where the control bits AV and control bits DV are again 
retrieved and interrogated, until such time as the control bits AV for all write buffer entries 1 52 are clear. Both 
allocation pointers 156x, 156y will point to the same entry 152 as their respective retire pointers 158x, 158y 

30 when all control bits AV are clear, considering the effect of the WBNOP bits. Once this empty condition is ach- 
ieved, process 278 can be initiated in which the exception condition is processed in the usual manner. 

6. Special write cycles from the write buffer 

35 As noted above relative to Figure 10, the retiring process may include special write operations from write 
buffer 29 to cache port 160 or directly to data bus DATA. According to the preferred embodiment of the inven- 
tion, these special write cycles can include the handling of misaligned writes, and also write gathering. Se- 
quences for handling these special write cycles according to the preferred embodiment of the invention will 
now be described in detail. 

40 

6.1 Misaligned writes 

As noted above, physical memory addresses presented within microprocessor 10 correspond to byte ad- 
dresses in memory, while data bus DATA is capable of communicating sixty-four bits in parallel (primarily from 
45 data input/output in bus interface unit BIU to unified cache 60 in this embodiment of the invention). Because 
the physical address in microprocessors of X86 compatibility type is not a modulo of the operand size, a sig- 
nificant fraction of memory writes may overlap eight-byte boundaries; these writes are referred to as "mis- 
aligned" writes. Write buffer 29 in microprocessor 10 according to the preferred embodiment of the invention 
accounts for such misaligned writes by indicating that a write buffer entry 152 is misaligned at the time of al- 
so location, allocating a second write buffer entry 152 which presents the second portion of the write, and by ini- 
tiating a special routine in the retiring process to account for the misaligned write. These sequences will now 
be described in detail relative to Figures 14 and 15. 

Figure 14 is a flow diagram of a portion of process 1 82 of the allocation sequence of Figure 6, for detecting 
misaligned writes and indicating the same for the write buffer entry 152 being allocated. In process 280 of Fig- 
55 ure 14, write buffer control logic 150 adds the physical address (lowest byte address) of the write operation to 
write buffer entry 152„ being allocated with the size (in bytes) of the writB operation. Information regarding the 
size of the write operation is contained within the instruction, as is typical for X86 type microprocessor instruc- 
tions. In decision 281 , write buffer control logic determines if the addition of process 280 caused a carry into 
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bit 3, indicating that the eight-byte boundary will be crossed by the write operation to the write buffer entry 
152„ being allocated. If decision 281 determines that no carry occurred, then the write to entry 152n will not 
be misaligned; process 282 is then performed in which misaligned write control bit, MAW, is cleared in entry 
152 R> and the allocation sequence continues (process 288). 

5 If a carry occurred, however, the write to entry 1 52» will cross the eight-byte boundary, in which case proc- 

ess 284 is performed to set control bit MAW in entry 152 n . The next write buffer entry 152^ to be allocated 
is then allocated for purposes of the misaligned write, in process 286, by loading the address portion of entry 
152^ with the physical start address for the write to the next eight-byte group (i.e., the eight-byte address 
after the detected carry in process 281), and setting the control bit AV for entry 152^,. Anew physical address 

10 calculation (pipeline stage AC2) is required in process 286, considering that the high physical address may 
reside on a different physical page. The data portion of entry 1 52 rH>1 will remain empty, however, as entry 1 52^ 
will merely be used in the retiring process to effect the second operand write to memory. The remainder of the 
allocation process then continues (process 288). 

Regardless of whether the write buffer entry 152 n is a misaligned write, issuing of data to entry 152 n occurs 

is in the manner described hereinabove relative to Figure 9. No special loading of the data portion of write buffer 
entry 1 52„ is effected according to this embodiment of the invention; in the case of a misaligned write, however, 
no issuing of data to entry 152^ will occur. 

Referring now to Figure 15, a sequence for handling the misaligned write in the retiring of a write buffer 
entry 152 win now be described. As in the previously described retiring sequences, the sequence of Figure 15 

20 is preferably performed under the control of the cache control logic with assistance from write buffer control 
logic 150. The sequence of Figure 15 is performed as part of processes 208 and 210 of Figure 10 described 
hereinabove. This sequence begins with decision 289, in which the control bit MAW of entry 152„ is tested; if 
clear, the retiring sequence continues (process 290 of Figure 15) in the manner described above. However, if 
control bit MAW is set for entry 152„, process 292 is next performed in which the data portion of entry 152„ 

25 is latched in the appropriate misaligned data latch 162x, 162y. 

The presentation of data from entry 152„ must be done in two memory accesses, considering the mis- 
aligned nature of the write. However, in splitting the write operation into two cycles, the data as stored in entry 
152n is not in the proper "byte lanes" for presentation to cache port 160. Referring back to Figure 4, shifter 
1 64 is a conventional barrel shifter for shifting the data presented from the corresponding write buffer section 

30 1 52x, 1 52y prior to its storage in its misaligned write latch 1 62x, 1 62y. Shifter 1 64 thus is able to effect a single 
shift of the data in the corresponding write buffer section 152n, such that the lower order data will appear in 
the higher order bit lanes (for presentation to cache port 1 60 in the first, lower order address, write operation), 
and so that the higher order data will appear in the lower order bit lanes (for presentation to cache port 1 60 in 
the second, higher order address, write operation). This shifting is effected in process 292 of the sequence 

35 illustrated in Figure 15. 

Process 294 is next performed by way of which the physical address of entry 152 n is presented to cache 
port 160 along with the portion of the data corresponding to the lower address eight-byte group, aligned (by 
shifter 164 in process 292) to the byte lanes corresponding to the lower address eight-byte group. This effects 
the first write operation required for the misaligned write. Process 296 then presents the address and data for 

40 the second operand of the misaligned write. The physical address is that stored in the address portion of the 
next write buffer entry 1 52 fH1 , and the data is that retained in misaligned write latch 1 62 from entry 1 52„, shifted 
by shifter 1 64 to the proper byte lanes for the second access to port 1 60. The remainder of the retiring process 
then continues (process 298). 

As noted above, the exception handling ability of microprocessor 10 according to this embodiment of the 

45 invention uses the state of the control bit DV to determine whether an entry 1 52 either is or is not flushed after 
detection of an exception. However, in the case of a misaligned write, the second write entry 152^ does not 
have its control bit DV set even if the write has been effected, since the valid data is contained within the pre- 
ceding (in program order) write buffer entry 1 52„. Accordingly, if both misaligned write handling capability and 
exception handling as described herein are provided, the exception handling sequence must also test both con- 
st) trol bit MAW and control bit DV for an entry 152* and, if both are set, must then consider the next write buffer 
entry 1 52^ (in program order) to also have its control bit DV set, such that entry 152 n>1 is not flushed. 

As a result of this construction, misaligned writes are handled by microprocessor 10 according to the pres- 
ent invention in a way which does not impact core 20 operation, but only includes an additional latching and 
aligning step during the asynchronously performed, and non-critical, retiring sequence. 

55 

6.2 Gathered writes 

Another type of special write operation performs ble by microprocessor 10 according to this embodiment 
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of the invention is the gathered write, where the data contained within successive write operations may be gath- 
ered into a single write access to memory. As noted above, each physical address corresponds to a byte lo- 
cation. If a series of writes are to be performed to one or a few bytes within the same block of bytes that may 
be placed on the data bus simultaneously, microprocessor 10 is able to retain the data in the appropriate byte 
5 lane so that a single write access to cache port 160 or to memory may be performed instead of successive 
smaller write accesses. For example, since memory data bus DATA in microprocessor 1 0 is sixty-four bits wide, 
eight bytes of data may be simultaneously written; according to the gathered write feature of the present in- 
vention, these eight bytes may be gathered from multiple write buffer entries 1 52 in the manner described here- 
inbeiow. 

w As described hereinabove relative to the allocation sequence for write buffer 29, mergeable control bit, 
MRG, is set at the time of allocation for each write buffer entry 1 52 that is performing a write to a contiguous 
non-overlapping physical memory address with that of another write buffer entry 1 52 previously allocated for 
the immediately preceding memory write instruction in program order. The contiguousness and adjacency con- 
straints are implemented according to this preferred embodiment of the invention in consideration of the X86- 

15 compatibility of microprocessor 10; it is contemplated, however, that write gathering may be implemented in 
other architectures in such a way that membership of the data in the same block of bytes is the only necessary 
constraint for mergeable writes. After allocation, issuing of data to the mergeable write buffer entries 1 52 con- 
tinues in the normal manner described hereinabove. 

Referring now to Figure 16, the gathered write operation according to the preferred embodiment of the 

20 invention will now be described in detail. Decision 299 determines whether the control bit MRG for the current 
write buffer entry 152„ being retired is set; if not, the normal retiring sequence continues (process 300). If con- 
trol bit MRG is set for the current entry 152n, process 302 is performed by way of which the data portion of 
entry 152„ is shifted by the appropriate shifter 164x, 164y, to the appropriate byte lanes to accommodate the 
gathered write. Process 304 is then performed, in which the shifted data is stored in write gather latch 165 in 

25 the proper byte lane position without disturbing data already loaded in write gather latch 165 from preceding 
contiguous non-overlapping writes. 

Decision 305 then interrogates the next write buffer entry 152 rH . 1 to determine if its control bit MRG is set 
If so, control returns to process 302 where the data for this next entry 1 52 r ^ 1 is shifted and latched into write 
gather latch 165 in process 304. Once no more mergeable entries 152 exist, as indicated by either the control 

30 bit MRG or the control bit AV being clear for the next entry 1 52 (in decision 305), the contents of latch 1 65 are 
presented to port 160, along with the appropriate physical address to accomplish the gathered write operation 
to cache 60 or main memory 86, as the case may be. The retiring process then continues as before (process 
308). 

According to the preferred embodiment of the invention, therefore, the efficiency of retiring data to cache 
35 or to memory is much improved by allowing for single memory accesses to accomplish the write operation in 
lieu of multiple accesses to contiguous memory locations. 

7. Conclusion 

40 According to the preferred embodiment of the invention, a write buffer is provided between the CPU core 
and the memory system (including cache memory) to provide buffering of the results of the executed instruc- 
tion sequence. This enables the cache and memory reads to be performed on a high priority basis with mini- 
mum wait states due to non-time-critical write operations that may be occupying the buses or memory systems. 
In addition, the preferred embodiment of the invention includes many features that are particularly bene- 

45 fidaJ for specific microprocessor architectures. Such features include the provision of two sections of the write 
buffer for superscalar processors, together with a technique for ensuring that the data is written to memory in 
program order despite the splitting of the buffer. Additional features of the preferred embodiment of the inven- 
tion include the detection and handling of hazards such as data dependencies and exceptions, and provision 
for speculative execution of instructions with rapid and accurate flushing of the write buffer in the event of an 

so unsuccessful prediction. 

While the invention has been described herein relative to its preferred embodiments, it is of course con- 
templated that modifications of, and alternatives to, these embodiments, such modifications and alternatives 
obtaining the advantages and benefits of this invention, will be apparent to those of ordinary skill in the art 
having reference to this specification and its drawings. It is contemplated that such modifications and alter- 

55 natives are encompassed within the scope of this invention. 
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Claims 

1. A microprocessor comprising: 

(a) core means for process data according to operations defined by a sequence of instructions; 
s (b) a write buffer having a plurality of entries coupled to the core means; 

(c) a cache memory having a plurality of memory locations coupled to the write buffer and the core means; 

(d) a bus coupled to the core means, the write buffer, and the cache memory; 

(e) control logic coupled to the bus (d) v to detect whether an instruction is a misaligned write instruction; 

(f) a shifter coupled to the write buffer to shift the contents of a first entry of the plural ity of entries detected 
10 as a misaligned write instruction by the control logic, prior to presentation of the first entry to the cache 

memory; and 

(g) a misaligned write latch coupled to the shifter and to the cache memory to latch the shifted contents 
of the f irst entry of the plurality of entries and to present the data corresponding to the misaligned write 
to the cache memory in first and second write cycles. 

15 2. The microprocessor of claim 1 , wherein each of the plurality of entries in the write buffer includes a mis- 
aligned write control bit that is set by the control logic in response to detection that an instruction is a misaligned 
write instruction. 

3. The microprocessor of daim 2, wherein the control logic further sets the misaligned write control bit in 
aO of the plurality of entries responsive to detecting the misaligned write instruction. 
20 4. The microprocessor of claim 3, wherein the control logic further loads the first entry of the plurality of 

entries with a physical address in response to the write operation being detected as a misaligned write, and 
further loads a second entry of the plurality of entries with a higher order physical address from that stored 
in the first entry of the plurality of entries to serve as the address for the second write cycle. 

5. In a pipelined microprocessor, a method of buffering results of operations executed by a central proc- 
25 essing unit core according to a series of instructions, such buffering being effected in a write buffer having a 

plurality of entries and effected prior to storage in a cache memory, comprising the steps of: 

(a) identifying whether an instruction is a misaligned write operation; 

(b) determining a first physical memory address of a first portion to which results of the misaligned write 
operation identified in step (a) are to be written; 

30 (c) storing the first physical memory address in a first write buffer entry; 

(d) determining a second physical memory address of a second portion to which results of the misaligned 
write operation identified in step (a) are to be written; 

(e) executing the misaligned write operation of step (a); 

(f) storing the first and second portions of the operation results from step (e) in the first write buffer entry; 
35 (g) latching the first and second portions of the operation results from the first write buffer entry into a 

latch; 

(h) presenting the first physical address and the first latched portion of the operation results to the cache 
memory in a first write cycle; and 

(i) presenting the second physical address and the second latched portion of the operation results to the 
40 cache memory in a second write cycle. 

6. The method of claim 5, wherein the first physical address corresponds to a lower order address than 
the second physical address. 

7. The method of claim 6, further comprising, prior to step (g), step (j) shifting the first and second portions 
of the instruction results so that the first portion of the instruction results resides in higher order byte positions 

45 than the second portion of the instruction results. 

8. The method of claim 5, further comprising step (k), responsive to step (a), setting a misaligned write 
control bit in the first write buffer entry to indicate that the write operation thereto will be a misaligned write. 

9 The method of claim 8, wherein step (g) is performed responsive to the misaligned write control bit being 
set in the first write buffer entry. 
so 10. A microprocessor comprising: 

(a) central processing means for processing data according to operations defined by a first type of program 
instructions; 

(b) secondary processing means for processing data according to operations defined by a second type 
of program instructions, the secondary processing means providing results having a data word greater in 

55 bit width than that provided by the central processing means; 

(c) a write buffer having a plurality of buffer entries coupled to the central processing means, each entry 
including a data portion to receive operation results and an address portion to store physical memory ad- 
dresses at which results are stored; 
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(d) a cache memory having a plurality of memory locations coupled to the write buffer to receive data there- 
from and coupled to the central processing means to present data thereto; 

(e) a bus coupled to the central processing means, the secondary processing means, the write buffer, and 
the cache memory; 

5 (f) secondary data latch means for storing results of the secondary processing means that are written to 

memory at a physical address stored in a buffer entry; and 

(g) routing means for routing the data portion of the write buffer entry or contents of the secondary data 
latch means, to the cache memory. 

11. A method of buffering results of data processing operations executed by a central processing unit and 
10 a secondary processing unit in a microprocessor, prior to storage of results in a cache memory of the micro- 
processor through use of a write buffer having a plurality of write buffer entries, each having a data portion 
and an address portion, where results of the secondary processing unit correspond to data words of greater 
bit width than that of the data portion of the write buffer entries, comprising the steps of. 

(a) determining a first memory address to store results of a first instruction; 
15 (b) storing the first memory address determined in step (a) in the address portion of a first write buffer 
entry; 

(c) executing a first instruction with either the centra) processing unit or the secondary processing unit; 

(d) responsive to the secondary processing unit executing the first instruction in step (c), storing results 
in a secondary data latch having a wider bit width than the data portion of the plurality of write buffer entries; 

20 (e) responsive to the central processing unit executing the first instruction in step (c), storing the results 
of the first instruction in the data portion of the first write buffer entry; 

(f) retrieving results of the first instruction from the write buffer to store in the cache memory by selecting 
the contents of the secondary data latch if the first instruction was executed by the secondary processing 
unft, or selecting the contents of the data portion of the first write buffer entry if the first instruction was 

25 executed by the central processing unit; and 

(g) presenting the contents selected in step (f) to the cache memory in combination with the first physical 
address stored in the first write buffer entry. 

12. A microprocessor comprising: 

(a) central processing means for processing data according to operations defined by instructions to be 
30 executed in a program order; 

(b) a write buffer including a plurality of buffer entries arranged in first and second sections, coupled to 
the central processing means; 

(c) a cache memory having a plurality of memory locations coupled to the write buffer and the central proc- 
essing means; 

35 (d) a bus coupled to the central processing means, the write buffer, and the cache memory; and, 

(e) control logic means for controlling the write buffer so that instruction results stored therein are pre- 
sented to the cache memory in program order. 

13. In a microprocessor, a method of buffering results of data processing operations executed by a central 
processing unit core according to a series of instructions in a program order, such buffering being prior to stor- 

40 age in a cache memory of the microprocessor, comprising the steps of: 

(a) for a plurality of instructions, determining a physical address to which instruction results are to be writ- 
ten; 

(b) for each physical address determined in step (a), storing the determined physical address into one of 
a plurality of write buffer entries, arranged into first and second sections; 

45 (c) executing the instructions; 

(d) storing results into the write buffer entries in which is stored the physical address for the instruction; 
and 

(e) retrieving, in the program order, the results from the write buffer entries in step (d), for storage in the 
cache memory at a location associated with the stored memory address. 

50 14. In a microprocessor, a method of buffering results of data processing operations executed by a central 
processing unit core according to a series of instructions in a program order, the instructions including writes 
to memory and non-cacheable reads from memory, such buffering being prior to storage in a cache memory 
of the microprocessor, comprising the steps of: 

(a) determining a physical address to which instruction results are to be written in memory; 
55 (b) storing the physical address determined in step (a) into one of a plurality of write buffer entries and 
setting an address valid control bit in the write buffer entry in which the physical address is stored; 

(c) determining the physical address of the memory location from which the non-cacheable read is to be 
accessed for each non-cacheable read instruction; 
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(d) loading a non-cacheable read dependency field having a plurality of bit positions, each bit position cor- 
responding to one of the write buffer entries, with the state of the address valid control bits for the corre- 
sponding write buffer entry; 

(e) executing the series of instructions; 

5 (f) storing the results into the write buffer entry in which is stored the physical address for the instruction; 

(g) retrieving, in program order, the stored results from the write buffer entries to store in the cache memory 
at a location associated with the stored memory address and clearing the bit in the non-cacheable read 
dependency field corresponding to the retrieved write buffer entry; and 

(h) performing the non-cacheable read, responsive to the non-cacheable read dependency field being 
10 clear. 

15. A microprocessor comprising: 

(a) central processing means for processing data according to operations defined by a sequence of in- 
structions; 

(b) a write buffer having a plurality of buffer entries coupled to the central processing means; 

is (c) a cache memory having a plurality of memory locations, coupled to the write buffer and to the central 
processing means; 

(d) a bus coupled to the central processing means, the write buffer, and the cache memory; 

(e) control logic means for detecting that first and second instructions include memory writes to addresses 
in the same byte group; and 

20 (f) gathered write latch means for storing a data portion of the first and second instructions responsive to 
the control logic means, and for presenting its contents to the memory cache in a single write cycle. 

16. In a pipelined microprocessor, a method of buffering results of data processing operations executed 
by a central processing unit core according to a series of instructions, such buffering being effected in a write 
buffer having a plurality of write buffer entries and effected prior to storage in a cache memory of the micro- 

25 processor, where data is communicated in a byte group from the write buffer to the cache memory in a write 
cycle, comprising the steps of: 

(a) detecting that first and second instructions include memory writes to addresses in the same byte group; 

(b) determining first and second physical memory addresses to which results of the instruction are to be 
written in memory; 

30 (c) storing the first and second physical addresses in first and second write buffer entries, respectively; 

(d) executing the first and second instructions; 

(e) storing the results of the first and second instructions in the first and second write buffer entries, re- 
spectively; 

(f) latching the results from step (e); and, 

35 (g) presenting a physical address corresponding to the byte group of the first and second physical address- 
es and the results latched in step (f) to the cache memory in a write cycle. 

17. A microprocessor comprising: 

(a) central processing pipeline means for processing data according to operations defined by program in- 
structions so that a writeback stage and an address calculation stage of a first and a second program in- 

40 structbn, respectively, are processed substantially simultaneously; 

(b) a write buffer including a plurality of buffer entries coupled to the central processing pipeline means; 

(c) a cache memory having a plurality of memory locations coupled to the write buffer and to the central 
processing means; 

(d) a bus coupled to the central processing pipeline means, the write buffer, and the cache memory; and 
45 (e) control logic means for comparing a physical address of a read operation requested by the second in- 
struction in the address calculation stage to addresses associated with each of the plurality of buffer en- 
tries to detect a read-after-write data dependency between the first and second instructions. 

18. A method of buffering results of data processing operations executed by central processing unit core 
in a microprocessor prior to storage in a cache memory of the microprocessor, comprising the steps of: 

so (a) determining a first memory address for storage of results of a first instruction; 

(b) storing the first memory address determined in step (a) in a first entry of a plurality of write buffer en- 
tries; 

(c) determining a second memory address from which data is to be read for a second instruction, the sec- 
ond instruction being later in program order than the first instruction; 

55 (d) comparing the second memory address determined in step (c) against the first memory address stored 
in the first entry of the plurality of write buffer entries to detect a match; 

(e) executing a first instruction to produce a first result; 

(f) storing the first result in the f rst write buffer entry; and, 
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(g) retrieving the first result from trie first write buffer entry for storage in the cache memory at a location 
associated with the first memory address. 

1 9. A microprocessor com prising: 

(a) central processing means for processing data according to operations defined by instructions for exe- 
5 cution in a program order, wherein at least one of the instructions is of a conditional branch type; 

(b) a write buffer including a plurality of buffer entries coupled to the central processing unit core to receive 
data therefrom corresponding to results of the instructions, each of the plurality of buffer entries including 
at least one speculation control bit indicating, when set, that data to be written to its buffer entry is from 
execution of an instruction in a predicted branch in program sequence after an instruction of the conditional 

10 branch type; 

(c) a cache memory having a plurality of memory locations coupled to the write buffer and to the central 
processing means; 

(d) a bus coupled to the central processing means, the write buffer, and the cache memory; and, 

(e) control logic for controlling the presentation of data by the write buffer to the cache memory so that 
is each write buffer entry presents data to the cache memory only if the speculation control bit is not set 

20. A method of buffering results of data processing operations executed by a central processing unit core 
of a pipelined microprocessor according to a series of instructions including at least one conditional branch 
instruction, such buffering being effected in a write buffer having a plurality of write buffer entries and effected 
prior to storage in a cache memory of the microprocessor, comprising the steps of: 

20 (a) detecting a conditional branch instruction, 

(b) predicting a first sequence of instructions to be executed prior to determining the state of a condition 
upon which step (a) depends; 

(c) for an instruction in step (b) corresponding to a write to memory, determining a first physical memory 
address to which results are to be written in memory, 

25 (d) storing the first physical address in a first write buffer entry; 

(e) executing the write to memory instruction predicted in step (b); 

(f) storing the results of step (e) in the first write buffer entry, 

(g) determining the condition upon which the conditional branch instruction depends; 

(h) responsive to step (g), indicating that step (b) was correct, and 

30 (i) retrieving the results of the first write buffer entry for storage in memory. 

21. A method of handling exception conditions in a pipelined microprocessor having a central processing 
unit core to execute data processing operations according to a series of instructions, the microprocessor in- 
cluding a write buffer having a plurality of write buffer entries to buffer results of instructions executed by the 
central processing unit core prior to storage in a cache memory, comprising the steps of: 

35 (a) determining a first memory address in which to store results of a first instruction; 

(b) storing the first memory address in a first write buffer entry 

(c) detecting an exception condition prior to execution of the first instruction; and, 

(d) responsive to step (c), invalidating the first write buffer entry. 
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